本篇博文主要内容为 2026-05-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-05)
今日共更新1241篇论文,其中:
- 自然语言处理共156篇(Computation and Language (cs.CL))
- 人工智能共387篇(Artificial Intelligence (cs.AI))
- 计算机视觉共257篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共359篇(Machine Learning (cs.LG))
- 多智能体系统共26篇(Multiagent Systems (cs.MA))
- 信息检索共30篇(Information Retrieval (cs.IR))
- 人机交互共39篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Executor-Side Progressive Risk-Gated Actuation for Agent ic AI in Wireless Supervisory Control
【速读】:该论文旨在解决生成式 AI (Generative AI) 在 O-RAN 无线监控控制自动化中,意图(intent)执行时缺乏明确语义决策机制的问题,尤其是在面对过时遥测数据、并发策略、时限与带宽限制、回滚约束等场景下,无法准确判断意图是否应提交、暂存或拒绝。解决方案的关键在于提出一种执行端契约——渐进式风险门控执行(Progressive Risk-Gated Actuation, PRGA),其将每个意图结构化为三个阶段:本地初步筛查(C0)、按需协调证据获取(C1)和事后溯源支持(C2),其中 C2 不参与在线安全路径;通过两阶段确定性策略验证时效性、新鲜度、回滚有效性、局部冲突、前置条件阻塞及规划器-执行器风险偏差,并仅在时限与带宽允许时检索 C1,强制要求证据的门控会拒绝缺失必要证据的意图,从而实现对控制平面效率和监督响应性的提升,同时严格维持在预设的 0.5% 不安全动作容忍边界内。
链接: https://arxiv.org/abs/2605.02697
作者: Zhenyu Liu,Yi Ma,Rahim Tafazolli
机构: 6GIC, Institute for Communication Systems, University of Surrey (萨里大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic artificial intelligence (AI) shows promise for automating O-RAN wireless supervisory control, but translated intents still require an executor-side decision before live network actuation. Existing control flows lack explicit semantics for whether an intent should commit, gate for evidence, or reject under stale telemetry, concurrent policies, deadline and bandwidth limits, and rollback constraints. We propose Progressive Risk-Gated Actuation (PRGA), an executor-side contract for risk-gated wireless intent execution. PRGA structures each intent into executable local triage (C0), on-demand coordination evidence (C1), and post-hoc provenance support (C2), with C2 kept off the online safety path. A deterministic two-stage policy checks expiry, freshness, rollback-handle validity, local conflict, blocking preconditions, and planner-executor risk divergence from C0, then retrieves C1 only for gated intents when deadline and bandwidth budgets allow; evidence-mandatory gates reject when required C1 is unavailable. On two 3GPP-parameterized energy-saving and slice-SLA benchmarks, PRGA reduces time-to-first-safe-action by 23.3-27.4% and per-commit control-plane bytes by 52.7-54.2% against a decision-identical eager full-evidence cost-overlay comparator, thereby isolating retrieval-cost accounting; remains non-inferior within a pre-declared 0.5 percentage-point unsafe-action margin against an invariant-respecting static-threshold comparator; and rejects 100% of injected over-threshold stale inputs in the stale-state fault campaign. On these benchmarks, PRGA improves supervisory responsiveness and control-plane efficiency within the evaluated unsafe-action boundary.
[MA-1] When Stress Becomes Signal: Detecting Antifrag ility-Compatible Regimes in Multi-Agent LLM Systems
【速读】:该论文试图解决的问题是:如何在多智能体大语言模型(Multi-agent LLM)系统中识别出能够支持未来抗脆弱性学习(antifragile learning)的结构化压力响应机制,而不仅仅是评估其在扰动下的鲁棒性(robustness)。传统评估通常关注性能是否在扰动下保持稳定,但本文提出,语义应力(semantic stress)可能揭示出可被利用的学习结构,从而推动系统从压力中进化。解决方案的关键在于提出 CAFE(Controlled Antifragility Framework for Evaluation),这是一个统计框架,通过建模受控的语义应力分布、从多维裁判信号重构架构特定的有效应力分布,并利用凸应力势下的分布 Jensen Gap 来检测是否存在抗脆弱性兼容的应力几何结构。若 Jensen Gap 为正,则表明观测到的应力分布呈现凸扩张变形,暗示存在可学习的压力结构,即使当前平均表现下降,也提示未来可通过针对性训练实现性能提升。
链接: https://arxiv.org/abs/2605.02463
作者: Jose Manuel de la Chica,Juan Manuel Vera,Jairo Rodríguez
机构: Santander AI Lab
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Multi-agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance is preserved under perturbation. This paper studies a different question: whether semantic stress exposes structured variation that could support future antifragile learning. We introduce CAFE, a statistical framework for detecting antifragility-compatible regimes in multi-agent architectures. CAFE models a controlled expected distribution of semantic stressors, reconstructs an architecture-specific observed effective stress distribution from multi-dimensional judge signals, and compares both distributions using a distributional Jensen Gap under a convex stress potential. A positive gap does not imply immediate performance improvement; instead, it indicates a convex-expansive deformation of the observed stress distribution, suggesting that the architecture exposes learnable stress structure. We evaluate CAFE on a banking-risk analysis benchmark with five multi-agent architectures: flat, hierarchical, debate, meta-adaptive, and ensemble. Across all architectures, semantic stress reduces average judged quality by roughly one third. Yet all architectures exhibit positive distributional Jensen Gaps with bootstrap confidence intervals above zero. These results show that immediate quality degradation can coexist with statistically detectable antifragility-compatible stress geometry. CAFE is therefore not an antifragile learner itself, but a measurement layer for identifying when and where antifragility learning may be worth applying.
[MA-2] FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
【速读】:该论文旨在解决用户任务描述与工具文档之间存在的语义鸿沟(semantic gap)问题,尤其在API生态系统规模扩展至数万个端点时,静态检索难以适应代理在执行过程中动态演变的理解需求。解决方案的关键在于提出FitText框架,其通过将检索过程嵌入代理的推理循环中实现动态检索:首先生成自然语言伪工具描述作为检索探针,再利用检索反馈迭代优化这些描述,并通过随机生成探索多样化备选方案;进一步引入遗传检索(Memetic Retrieval),基于工具记忆施加进化选择压力以避免冗余搜索。实验表明,该方法显著提升了检索精度和任务成功率,在ToolRet和StableToolBench上分别取得平均检索排名从8.81提升至2.78及pass rate提升24个百分点。
链接: https://arxiv.org/abs/2605.02411
作者: Kyle Zheng,Han Zhang,Renliang Sun,Chenchen Ye,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent’s understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent’s reasoning loop. FitText generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (43k tools, 4 domains), FitText improves average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs), it achieves a 0.73 average pass rate–a 24-point absolute gain over static query retrieval. The gains transfer across base models capable of acting as competent semantic operators; under weaker base models, Memetic’s evolutionary search inverts–amplifying noise rather than refining signal–surfacing model capacity as a prerequisite for evolutionary tool exploration.
[MA-3] LLM -enabled Social Agents
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)驱动的智能体在社会交互中缺乏社会可理解性的问题,即尽管LLMs具备流利的语言能力,但其行为仍难以体现角色、规范、意图和情境约束等社会性要素,从而限制了其在真实社会环境中的有效参与。解决方案的关键在于将LLM智能体的行为基础从单纯的自然语言生成转向基于“人格描述”(persona descriptions)的角色定义,通过操作化地建模角色特征,使智能体能够依据明确的社会角色进行推理与决策,从而实现从语言能力向社会行为的转化。
链接: https://arxiv.org/abs/2605.02335
作者: Önder Gürcan,Moharram Challenger
机构: NORCE Research AS (挪威研究公司); University of Antwerp (安特卫普大学); Flanders Make (弗拉芒制造)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, Hybrid Human Artificial Intelligence (HHAI) 2026
Abstract:Large Language Models (LLMs) have transformed agent-agent and human-agent interaction by enabling software, physical, and simulation agents to communicate and deliberate through natural language. Yet fluent language use does not by itself yield socially intelligible behaviour. Most current systems remain weakly grounded in roles, norms, intentions, and contextual constraints, limiting their capacity for meaningful participation in social environments. This paper develops a conceptual baseline for LLM-enabled social agents by arguing that they should be grounded in role definitions operationalized through persona descriptions. On this basis, we outline research directions for representation, hybrid control, and evaluation. The paper concludes that persona-based role definitions are a necessary foundation for turning language competence into social behaviour.
[MA-4] SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多主体(multi-party)交互场景中处理信息不对称(information asymmetry)能力不足的问题,即如何在隐私敏感环境中恰当地判断何时以及向谁披露信息。现有基准测试无法有效评估这一能力,因此作者提出SOTOPIA-TOM框架,其关键在于构建一个支持公开(广播)与私密(直接消息)通信的交互环境,并设计了涵盖八个行业领域的160个经人工审核的场景,每个场景包含3至5个拥有部分私有知识且受渠道依赖共享策略约束的代理(agent)。通过多维评估体系衡量代理的信息共享、缺失信息获取、协调效率和隐私保护能力,并引入复合指标INFOMGMT进行综合评价,结果表明即使是最先进的模型(如GPT-5)也仅获得62%的INFOMGMT分数,凸显当前LLM代理在信息获取和隐私意识决策上的持续缺陷;进一步地,基于理论心智(Theory of Mind, ToM)的干预策略显著改善了协调与隐私之间的平衡,例如ToM-Coach将GPT-4o的关键隐私违规率从9.9%降至2.2%,同时使INFOMGMT得分提升超过2.5倍(从15%升至40%),验证了该框架作为可扩展测试平台对发展更具隐私意识和心智理论能力的多智能体系统的重要性。
链接: https://arxiv.org/abs/2605.02307
作者: Yashwanth YS,Ruichen Wang,Shihua Zeng,Xuhui Zhou,Koichi Onoue,Vasudha Varadarajan,Maarten Sap
机构: Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA; Fujitsu Technologies, Pittsburgh, USA
类目: Multiagent Systems (cs.MA)
备注: 37 pages, 22 Figures
Abstract:As LLM-based agents are increasingly interacting in multi-party settings, they need to properly handle information asymmetry, i.e., knowing when and to whom to disclose information is appropriate. Yet, existing benchmarks fail to measure this ability in realistic multi-party settings. Thus, we introduce SOTOPIA-TOM, a multi-dimensional benchmarking framework to evaluate LLM agents’ ability to successfully navigate information asymmetric and privacy sensitive multi-party interactions. We create an interaction environment which enables both public (broadcast) and private (direct message) communication, and craft 160 human-reviewed scenarios across eight industry sectors, each involving 3 to 5 agents with partitioned private knowledge and channel-dependent sharing policies. To measure interaction abilities, we create a multi-dimensional evaluation framework to assess how well agents share useful information, seek missing details, coordinate efficiently, and protect privacy, which we also combine into a composite INFOMGMT metric. Results show that, across 6 LLM backbones and prompting strategies (vanilla, CoT-privacy, and ToM-based interventions), even the largest high-reasoning model (GPT-5) reaches only a 62% INFOMGMT score, which indicates persistent deficiencies in information seeking and privacy-aware decision-making. Additionally, ToM-based interventions more consistently improve the overall coordination-privacy balance (for example, relative to the vanilla baseline, ToM-Coach reduces critical privacy violations on GPT-4o from 9.9% to 2.2% while increasing the composite InfoMgmt score more than 2.5x from 15% to 40%). Overall, SOTOPIA-TOM exposes persistent limitations of current LLM agents in complex, information-asymmetric coordination and provides an extensible testbed for developing more privacy-aware, theory-of-mind capable multi-agent systems.
[MA-5] Distributed Observer-based Fault Detection over Intelligent Networked Multi-Vehicle Systems
【速读】:该论文旨在解决混合交通环境中(人类驾驶车辆与自动驾驶车辆共存)传感器测量异常检测与隔离问题,尤其是在缺乏局部可观测性假设下,如何实现分布式故障/攻击检测。其关键解决方案在于设计基于本地残差的故障检测与隔离(Fault Detection and Isolation, FDI)策略,每个自动驾驶车辆(CAV)通过分布式一致性观测器跟踪人类驾驶车辆(HDV)状态,并利用概率阈值设计对残差进行分析,从而在无需中央处理单元的情况下实现本地化异常检测。该方法不依赖噪声具有有界支撑的假设,更贴合实际多车运输系统的不确定性特征。
链接: https://arxiv.org/abs/2605.02235
作者: Mohammadreza Doostmohammadian,Hamid R. Rabiee
机构: 未知
类目: ystems and Control (eess.SY); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Signal Processing (eess.SP); Optimization and Control (math.OC)
备注: European journal of control
Abstract:Decentralized strategies are of interest for local decision-making over multi-vehicle networks. This paper studies mixed traffic networks of human-driven and autonomous vehicles with partial sensor measurements. The idea is to enable the group of connected autonomous vehicles (CAVs) to track the state of a group of human-driven vehicles (HDVs) via distributed consensus-based observers/estimators. Particularly, we make no assumption that the group of HDVs is locally observable in the direct neighborhood of any CAV. Then, the main contribution is to design local residual-based fault detection and isolation (FDI) at every CAV to detect possible faults/attacks in the sensor measurements. This distributed detection strategy enables every CAV to locally find possible anomalies in its taken sensor measurement with no need for a central processing unit. Two FDI logics are proposed with and without considering the history of the residuals. These FDI techniques are based on probabilistic threshold design on the residuals (in contrast to the existing deterministic threshold FDI techniques) with no assumption that the noise is of bounded support. This is more realistic in real-world multi-vehicle transportation systems.
[MA-6] Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning
【速读】:该论文旨在解决语言模型(Language Model, LM)驱动的智能体在执行长周期任务时面临的规划与推理能力不足的问题。其解决方案的关键在于提出一种多智能体框架,将自动化任务分解为三个角色:规划器(planner)负责高层决策、执行器(actor)负责具体操作、记忆管理者(memory manager)负责上下文推理;并通过系统的计算资源分配分析发现,规划是影响任务性能的主要因素,而执行和记忆管理所需的计算资源和模型容量远低于规划模块。基于此洞察,作者设计了一种以规划器为中心的强化学习方法,仅对规划器进行优化,并利用视觉语言模型(Vision-Language Model, VLM)作为裁判提供轨迹级奖励,其余组件保持冻结,从而在网页导航、操作系统控制和工具使用等多个基准上实现了高效且稳定的长周期任务自动化提升。
链接: https://arxiv.org/abs/2605.02168
作者: Wenyi Wu,Sibo Zhu,Kun Zhou,Biwei Huang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Language model (LM)-based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning. To address this, we propose an enhanced multi-agent framework that decomposes automation into three roles: a planner for high-level decision-making, an actor for task execution, and a memory manager for contextual reasoning. While this modular decomposition aligns with established design patterns, our core contribution lies in a systematic compute-allocation analysis, revealing that planning is the dominant factor influencing task performance. Execution and memory management require significantly less compute and model capacity to achieve competitive results. Building on these insights, we introduce a planner-centric reinforcement learning approach, which exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while freezing the other components. Extensive experiments on benchmarks spanning web navigation, OS control, and tool use demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.
[MA-7] AAFLOW: Scalable Patterns for Agent ic AI Workflows
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)系统中代理工作流(Agentic Workflow)在可扩展性和可复现性方面的瓶颈问题,这些问题主要源于数据编排碎片化、序列化开销以及非确定性执行。其解决方案的关键在于提出 AAFLOW——一个统一的分布式运行时,通过将代理工作流建模为操作符抽象(operator abstraction),构建通信高效的执行计划;同时利用 Apache Arrow 和 Cylon 实现零拷贝数据平面,消除预处理、嵌入(embedding)和向量检索之间的序列化开销,并结合资源确定性调度与异步批处理机制降低协调成本,从而显著提升数据流效率与整体管道性能,实验表明可实现最高 4.64 倍的流水线加速和 2.8 倍的嵌入及插入阶段性能提升。
链接: https://arxiv.org/abs/2605.02162
作者: Arup Kumar Sarker,Mills Staylor,Aymen Alsaadi,Gregor von Laszewski,Shantenu Jha,Geoffrey Fox
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 10 pages, 8 Figures, 3 Tables. preprint for SC2026
Abstract:Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and non-deterministic execution. Although these frameworks increase flexibility, they don’t have a formal execution model that adheres to the principles of high-performance computing. We introduce AAFLOW, a unified distributed runtime that creates communication-efficient execution plans by modeling agentic workflows as an operator abstraction. Using Apache Arrow and Cylon, AAFLOW creates a zero-copy data plane that allows direct interoperability between preprocessing, embedding, and vector retrieval without the need for serialization overhead. To lower coordination costs, it uses resource-deterministic scheduling and asynchronous batching. While retaining comparable LLM generation throughput, experimental results demonstrate up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases. Rather than LLM inference acceleration, these advantages result from enhanced data flow, batching, and communication efficiency.
[MA-8] Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中混合动机(mixed-motive)环境下的战略合竞(strategic coopetition)建模与评估难题,即如何在合作与竞争并存的复杂交互场景中构建可验证、可复现且结构清晰的基准平台。其解决方案的关键在于提出Coopetition-Gym v1,这是一个集成连续动作空间、参数化奖励互惠机制(parameterized reward mutuality)、校准的依存系数矩阵(calibrated interdependence matrix)、博弈论基准策略(game-theoretic oracle baselines)以及历史案例验证能力的综合性基准平台。该平台通过将收益结构(payoff structure)与奖励层(reward layer)解耦,支持奖励类型消融实验,并包含20个基于四篇基础技术报告设计的环境,其中4个经历史案例校准,重现率达81.7%–98.3%,从而实现了对合竞行为的系统性研究与算法对比分析。
链接: https://arxiv.org/abs/2605.02063
作者: Vik Pant,Eric Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 82 pages, 14 figures, 9 tables, 51 references. AI-track technical report companion to the four-paper foundational series; should be read with arXiv:2510.18802 , arXiv:2510.24909 , arXiv:2601.16237 , and arXiv:2604.01240 . Reproducibility package and source code: this https URL . Datasets released under CC-BY-4.0 at this https URL
Abstract:We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform’s principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies. Comments: 82 pages, 14 figures, 9 tables, 51 references. AI-track technical report companion to the four-paper foundational series; should be read with arXiv:2510.18802, arXiv:2510.24909, arXiv:2601.16237, and arXiv:2604.01240. Reproducibility package and source code: this https URL. Datasets released under CC-BY-4.0 at this https URL Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T05, 68T42, 91A06, 91A26 ACMclasses: I.2.6; I.2.11 Cite as: arXiv:2605.02063 [cs.MA] (or arXiv:2605.02063v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.02063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-9] Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
【速读】:该论文旨在解决序列决策问题中因信用分配(credit assignment)困难而导致的性能下降问题,尤其在高阶语义选择与底层动作执行之间存在延迟和模糊反馈的场景下。其核心挑战在于区分是抽象层次的错误、执行策略的次优,还是二者交互导致的性能劣化。解决方案的关键在于提出一种基于语言驱动的分层强化学习框架,其中高阶抽象选择与低阶执行策略均由大语言模型(LLM)参数化,并通过纯提示(prompt)更新进行优化,无需梯度微调。该方法通过显式分离抽象选择与执行过程,降低跨层级非平稳性,在延迟反馈条件下实现针对性适应,从而提升整体决策质量。
链接: https://arxiv.org/abs/2605.01954
作者: Polydoros Giannouris,Yuechen Jiang,Lingfei Qian,Yuyan Wang,Xueqing Peng,Jimin Huang,Guojun Xiong,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); The Fin AI; Harvard University (哈佛大学); Archimedes/Athena RC (阿基米德/雅典娜研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.
[MA-10] A Language for Describing Agent ic LLM Contexts WWW
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理系统中上下文(context)设计缺乏标准化描述的问题。现有方法通常依赖非正式文本、临时图表或代码直接观察来传达LLM输入上下文的组成及其随时间演化的机制,无法精确刻画提示(prompt)在交互步骤中的变化过程,也无法清晰比较不同上下文表示策略的差异。解决方案的关键在于提出一种名为“代理上下文描述语言”(Agentic Context Description Language, ACDL)的形式化语言,它能够以结构化、可读且标准的方式定义LLM输入上下文的组成与动态特性,包括角色消息序列、动态内容、时间索引引用以及条件或迭代结构等核心要素,并支持可视化呈现。ACDL独立于具体实现,既可用于白板手绘交流,也可通过正式语法生成标准化文档,从而提升LLM系统设计的透明度和复现性。
链接: https://arxiv.org/abs/2605.01920
作者: Noga Peleg Pelc,Gal A. Kaminka,Yoav Goldberg
机构: Bar-Ilan University (巴伊兰大学); Ai2
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 18 pages, 12 figures. Accepted at CAIS '26. Project page: this http URL
Abstract:Large language models are increasingly used within larger systems (“LLM agents”). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure play a central role in the quality of the resulting system, leading to efforts spent on context engineering. It is therefore critical to communicate the composition of the LLM context in a system, and how it evolves over time. Yet, no standard exists for doing so: context construction is typically conveyed through informal prose, ad hoc diagrams, or direct inspection of code, none of which precisely capture how a prompt evolves across interaction steps or how two context representation strategies differ. To remedy this, we introduce the Agentic Context Description Language (ACDL), a language for specifying the structure and dynamics of LLM input contexts in a precise, readable, and standard manner, along with visualizations. ACDL provides constructs for specifying context aspects such as role message sequences, dynamic content, time-indexed references, and conditional or iterative structure, capturing the full architecture of a prompt independently of any particular implementation. ACDL diagrams can be hand drawn on a whiteboard, or written in formal language which can then be rendered. We describe the language, demonstrate it by documenting several existing systems and their variants, and encourage the community to adopt it for describing LLM systems context, both in day-to-day communication and in papers. Tooling, examples and documentation are available at this http URL.
[MA-11] Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中因状态-动作空间组合爆炸导致的协调策略稀有性问题,以及内在动机(Intrinsic Motivation)在探索强度(exploration intensity, β)设置不当所引发的协调崩溃或难以发现稀有策略的问题。解决方案的关键在于:一是设计一种基于回报条件的Sigmoid调度函数(Return-conditioned Sigmoid Schedule, RCB),实现训练过程中全局探索强度β的动态自适应调整;二是引入每智能体的奖励信号质量(Reward Signal Quality, RSQ)指标,根据各智能体内在奖励信号的可靠性自动分配探索预算,从而抑制噪声信号驱动的过度探索。核心洞察是:接收噪声内在奖励的智能体应减少探索强度,该分配可由信号-噪声统计特性自动确定。通过采用继承距离(Successor Distance, SD)作为内在奖励机制,自然产生可区分的每智能体信号质量,最终在七个协作基准环境(MPE、SMAX、MABrax)上实现了卓越且稳定的性能表现。
链接: https://arxiv.org/abs/2605.01865
作者: Dahyun Oh,Minhyuk Yoon,H.Jin Kim
机构: Seoul National University (首尔国立大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Submitted to Neurocomputing
Abstract:Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation, which augments task rewards with novelty bonuses, is a popular approach for driving exploration, but its effectiveness hinges on the exploration intensity \beta , where too large a value overwhelms the task signal and causes coordination collapse, while too small a value prevents discovery of rare strategies. We address two complementary challenges: adapting \beta globally over training, and allocating the exploration budget across agents whose intrinsic reward signals vary in reliability. Our framework combines a return-conditioned sigmoid schedule (RCB) for global intensity control with a per-agent Reward Signal Quality (RSQ) metric that concentrates the exploration budget on agents with reliable signals. The core insight is that agents receiving noisy intrinsic rewards should explore less aggressively, and this allocation can be determined automatically from signal-to-noise statistics. Successor Distance (SD), a quasimetric intrinsic reward, naturally produces distinguishable per-agent signal quality, completing the framework with convergence and ordering preservation guarantees. On seven cooperative benchmarks (MPE, SMAX, MABrax), our method achieves top-tier returns across all environments.
[MA-12] MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中如何设计有效学习信号以促进智能体间协调的问题。其核心挑战在于量化智能体之间的长期因果影响(long-term causal influence),从而引导协作行为。解决方案的关键在于提出一种名为多步优势门控干预因果MARL(MAGIC)的框架,该框架通过条件互信息(conditional mutual information)实现因果干预以度量多步因果影响,并引入基于优势的门控机制(advantage-based gating mechanism),将这些因果信号选择性地转化为内在奖励,从而引导探索聚焦于有益且目标对齐的行为。实验表明,MAGIC在多个标准MARL基准(如MPE和SMAC/SMACv2)上显著优于现有方法,主评价指标提升至少10.1%。
链接: https://arxiv.org/abs/2605.01805
作者: Haohan Yu,Jinmiao Cong,Shengzhi Wang,Lu Wang,Chanjuan Liu
机构: Dalian University of Technology (大连理工大学); Microsoft (微软)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals necessitates the ability to quantify the true, long-term causal influence between agents. To address this, we introduce Multi-step Advantage-Gated Interventional Causal MARL (MAGIC), a framework that extracts multi-step causal influences between agents and selectively converts them into intrinsic rewards. MAGIC uses causal intervention with conditional mutual information to quantify long-horizon agent influence, and introduces an advantage-based gating mechanism to ensure exploration is directed toward beneficial, goal-aligned behaviors. Experiments across multiple standard MARL benchmarks and task families, including MPE and SMAC/SMACv2, demonstrate that MAGIC outperforms state-of-the-art methods by a significant margin, achieving an improvement of at least 10.1% in the main evaluation metric.
[MA-13] Koopman Representations for Early Outbreak Warning and Minimal Counterfactual Intervention in Multi-Agent Epidemic Simulations
【速读】:该论文旨在解决多智能体流行病模拟中早期暴发检测与干预策略选择的问题,尤其关注接近临界状态的流行病 regimes,其中微小的暴露变化或时间差异即可显著影响最终传播结果。解决方案的关键在于构建一个基于Koopman算子的框架,将早期轨迹窗口中的每日聚合观测值映射到低维Koopman隐空间,该空间中系统演化近似线性,从而支持短期预测和暴发风险估计;同时结合随机森林分类器,利用Koopman特征预测最终攻击率是否超过暴发阈值,实验证明该方法在系统临界点附近具有优异的早期预警性能,并通过反事实分析验证了最小干预(如单个个体居家一天)可有效降低攻击率并改变传播轨迹至安全区间。
链接: https://arxiv.org/abs/2605.01803
作者: Florin Leon
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 37 pages, 12 figures
Abstract:This paper presents a Koopman-based framework for early outbreak detection and intervention selection in a multi-agent epidemic simulation. Agents exhibit mobility patterns, heterogeneous susceptibility, immunity-dependent viral load progression, and local transmission through co-location. The goal of the simulation is to study near-critical epidemic regimes in which small changes in exposure or timing can alter the final outcome. Aggregate daily observables from early trajectory windows are encoded into a low-dimensional Koopman latent space whose approximately linear evolution supports short-horizon forecasting and outbreak risk estimation. These representations are combined with a random forest classifier trained to predict whether the final attack rate exceeds a major outbreak threshold. Experiments near the system tipping points show strong early warning performance, with Koopman-derived features contributing to class separation. Counterfactual analysis further shows that minimal interventions, such as keeping a single selected agent at home for one day, can reduce attack rates and, often, shift the trajectory below the outbreak threshold.
[MA-14] alk is Cheap Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation
【速读】:该论文旨在解决多智能体大型语言模型(Large Language Model, LLM)在动态交互中缺乏有效协同机制的问题,特别是如何通过对话修复语义共享(grounding)失效的困境。现有基准测试多聚焦于静态、单轮任务,忽视了跨轮次沟通中意义协商与承诺维持的能力。其解决方案的关键在于构建一个迭代式多轮谈判游戏框架,其中两个智能体需基于可验证的最优结果分配共享资源完成私有项目;实验揭示出四大失败模式:(1)缺乏共享交互历史导致协调能力下降;(2)累积上下文可能因锚定效应成为负担;(3)对形式公平(如均分)的依赖抑制了收益最大化协作;(4)指称绑定失败使智能体无法追踪跨轮次承诺。研究进一步将协作差距分解为可测量成分:Oracle基线表明个体推理非瓶颈,无对话基线证明通信必要性,全透明干预则指出信息交换本身不足以实现高效协同——真正的瓶颈在于联合计划制定、承诺建立与执行过程所构成的动态接地(dynamic grounding)。
链接: https://arxiv.org/abs/2605.01750
作者: Yiheng Yao,Chelsea Zou,Robert D. Hawkins
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose. While static grounding maps language to a shared, externally observable context, dynamic grounding is a joint activity where meaning is negotiated through interaction. Current multi-agent Large Language Model (LLM) benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns. We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes. While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models. Our investigation reveals four failure modes: (1) coordination degrades when shared interaction history is absent; (2) yet accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable; (3) a reliance on perfunctory fairness (equal resource splits) over reward-maximizing coordination; and (4) failures in referential binding, where agents lose track of commitments across turns. These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination. Our framework decomposes the coordination gap into measurable components: the oracle baseline establishes that the gap is not attributable to individual reasoning limitations; the no-talk baseline establishes that communication is necessary; and a full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.
[MA-15] Architectural Obsolescence of Unhardened Agent ic-AI Runtimes
【速读】:该论文旨在解决代理型人工智能(Agentic AI)运行时在执行工具调用、发送消息及控制设备过程中,因行为偏离审计记录而导致的安全风险问题。具体而言,研究识别出四种关键的偏离模式:F1(门禁绕过)、F2(审计伪造)、F3(主机静默故障)和F4(目标错误),这些是保障运行时安全的核心属性。现有主流方案如OpenClaw无法检测任何一种偏差(召回率为0),而本文提出的enclawed-oss框架通过引入七种缺失的运行时结构——包括双条件检查器、哈希链式审计日志、扩展准入门控、双层出口防护、Bell-LaPadula分类策略、模块签名信任根和引导封印机制——实现了对F1–F4的完全检测(准确率与召回率均为1.000)。其核心创新在于结构性重构而非参数调整,证明了未加固的代理型AI运行时在架构上已过时,且存在一个可立即采用的更优替代方案。
链接: https://arxiv.org/abs/2605.01740
作者: Alfredo Metere
机构: Metere Consulting, LLC.
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:An agentic-AI runtime issues tool calls, sends messages, and actuates devices on behalf of an LLM. Catching the four ways an action can diverge from its audit record – F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, – is a load-bearing safety property of any such runtime. We show that upstream OpenClaw, the most engineered single-user agentic-AI gateway in public release, catches none of them: recall is 0.000 on every cell of every confusion matrix, on a 1600-sample template baseline through OpenClaw’s actual production command-line interface (CLI) and on a ten-LLM cross-model generalisation run. Detecting F1–F4 requires seven specific runtime structures absent from OpenClaw’s source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss – an MIT-licensed drop-in fork that ships all seven – reaches P = R = F_1 = accuracy = 1.000 on the same input. The gap is structural, not parametric: a six-line append-only widening of enclawed-oss’s data-loss-prevention (DLP) regex catalog raises per-channel F3 detection by 14.6% net at unchanged precision; the same edit on OpenClaw has nowhere to land. The harness deliberately exercises real Discord and Telegram channels – plugin categories the first enclawed release deleted as unsafe – to show F1–F4 detection extends to those previously-unsafe extensions. With architectural superiority for security and feature parity for extensions, we argue that unhardened agentic-AI runtimes are architecturally obsolete: a strictly better alternative exists, is adoptable today, and the gap requires re-architecture rather than configuration. We invite reviewers to apply the harness to any candidate runtime.
[MA-16] Distributed Algorithm with Emergent Area Partitioning and Base Stations Situation Awareness for Multi-Robot Patrolling
【速读】:该论文旨在解决多机器人巡逻中如何提升巡逻效率与基地站操作员态势感知能力的问题。解决方案的关键在于提出一种名为局部反应与分区(Local Reactive and Partition, LR-PT)的新型多机器人巡逻算法,其核心机制是:各机器人基于本地信息独立选择巡逻目标,并将巡逻需求与向基地站汇报任务进展的紧迫性统一纳入一个效用函数进行决策;同时,该算法能自主形成区域划分,避免陷入局部最优,从而实现对整个任务区域的全面覆盖,且在通信受限和机器人故障等情况下仍保持鲁棒性。
链接: https://arxiv.org/abs/2605.01501
作者: Kazuho Kobayashi,Shohei Kobayashi,Seiya Ueno,Takehiro Higuchi
机构: Yokohama National University (横滨国立大学); Yokohama National University (横滨国立大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Patrolling with multiple robots offers efficient surveillance to detect and manage undesired situations. This necessitates improved patrol efficiency and operator situation awareness at base stations. Enhanced situation awareness enables operators to predict robots’ behaviors, support recognition and decision-making, and execute emergency interventions. This study presents the Local Reactive and Partition (LR-PT) algorithm, a novel multi-robot patrolling approach. In simulations, LR-PT outperformed existing methods by ensuring frequent patrols of all locations of interest and enhancing the situation awareness of the base station. Robots independently select patrol targets based on locally available information, integrating patrol needs and the urgency of reporting mission progress to the base station into a unified utility function. This locality also contributes to robustness against communication constraints and robot failures, as demonstrated in this research. The algorithm further autonomously emerged the area partition, which can avoid falling into local optima and realize the comprehensive patrol over the whole mission area. The simulation results demonstrated the superior performance of LR-PT for multi-robot patrolling, utilizing the advantages of swarm robotics and addressing real-world operational challenges.
[MA-17] LLM -Forag ing: Large Language Models for Decentralized Swarm Robot Forag ing
【速读】:该论文旨在解决群体觅食算法(如中央地点觅食算法,CPFA)在部署环境变化时性能下降的问题,这类算法通常依赖遗传算法(GA)或强化学习进行离线参数优化,导致策略与特定团队规模、场地大小和资源分布高度耦合,一旦环境改变便需重新训练,计算成本高昂。解决方案的关键在于提出一种无需训练的去中心化群集控制器 LLM-Foraging,其通过在 CPFA 状态机的三个结构化决策点(回巢后、到达中心区、搜索饥饿状态)引入大语言模型(LLM)作为战术决策模块,每个机器人本地运行 LLM 客户端并基于自身可观测状态进行查询,而 CPFA 的运动与感知模块执行所选动作。由于 LLM 提供的是通用决策策略而非针对单一配置的参数,因此该控制器在部署时无需训练,并能跨不同配置(包括团队规模、场地尺寸和资源分布)实现零样本迁移,显著提升了适应性和一致性。
链接: https://arxiv.org/abs/2605.01461
作者: Peihan Li,Joanna Gutierrez,Fabian Hernandez,Qi Lu,Lifeng Zhou
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Swarm foraging algorithms, such as the central-place foraging algorithm (CPFA), typically rely on offline parameter optimization using genetic algorithms (GA) or reinforcement learning, yielding policies tightly coupled to a specific combination of team size, arena size, and resource distribution. When deployment conditions change, performance degrades, and retraining is computationally expensive. We propose LLM-Foraging, a decentralized swarm controller that augments the CPFA state machine with a large language model (LLM) tactical decision-maker at three structured decision points, namely post-deposit, central-zone arrival, and search starvation. Each robot runs its own LLM client and queries it using only locally observable state, while the existing CPFA motion and sensing stack executes the selected action. Because the LLM serves as a general decision policy rather than parameters fitted to a single configuration, the controller is training-free at deployment and transfers across configurations without re-optimization. We evaluate LLM-Foraging in Gazebo with TurtleBot3 robots across 36 configurations spanning team sizes of 4 to 10 robots, arena sizes from 6x6 to 10x10 meters, and three resource distributions (clustered, powerlaw, random). LLM-Foraging collects more resources than the GA-tuned CPFA baseline across the evaluated configurations and is more consistent, a property that the GA’s single-configuration tuning does not transfer.
[MA-18] rAIson: Developing Reliable Decision-Making Agents AAMAS2026
【速读】:该论文旨在解决复杂现实应用场景中自动化、可靠且可解释决策代理(decision-making agents)的开发难题。传统方法往往依赖大量手工编码,导致开发效率低、维护困难且缺乏透明性。解决方案的关键在于提出rAIson平台,这是一个高级别的技术环境,能够无需编写任何代码即可支持复杂应用的开发,从而显著提升开发效率与系统可解释性,同时保障决策过程的可靠性。
链接: https://arxiv.org/abs/2605.01351
作者: Pavlos Moraitis,Nikolaos Spanoudakis,Antonis Kakas
机构: Université Paris Cité (巴黎大学); Hellenic Mediterranean University (希腊地中海大学); University of Cyprus (塞浦路斯大学)
类目: Multiagent Systems (cs.MA)
备注: Accepted as demonstration paper for publication at AAMAS 2026
Abstract:This paper presents the rAIson platform, a high-level technological environment for the development of automated, reliable and explainable decision-making agents. The research underlying the platform and its technological progress has now reached a mature stage that allows the platform to be used for the development of complex real-life applications without writing a single line of code.
[MA-19] When Embedding-Based Defenses Fail: Rethinking Safety in LLM -Based Multi-Agent Systems
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)中因智能体间通信而引入的安全问题,即恶意智能体通过传播虚假信息操纵群体决策,从而破坏系统的可靠性。现有基于嵌入(embedding)的防御方法依赖于区分恶意与良性消息的文本嵌入空间,但攻击者可通过构造嵌入接近良性消息的误导性内容绕过此类检测。论文指出,这类防御的根本局限在于忽略了token级置信度信号(如logits),即使在嵌入不可区分的情况下,这些信号仍可能保留可判别性。解决方案的关键在于利用置信度分数对通信中的消息进行剪枝或降权,从而提升系统鲁棒性;实验表明该策略在不同模型、数据集和通信拓扑下均有效,且强调了早期干预的重要性,以应对置信度随通信轮次衰减的问题。
链接: https://arxiv.org/abs/2605.01133
作者: Lingxi Zhang,Guangtao Zheng,Hanjie Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.
[MA-20] Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure
【速读】:该论文旨在解决欧盟人工智能法案(EU AI Act)在多主体协同的智能基础设施系统中导致的问责缺口问题,尤其是在交通信号控制与电网管理等跨部门自治系统交互时,居民难以获得解释且缺乏明确责任主体的困境。其解决方案的关键在于提出AgentGov-SC三层次治理架构(Agent、Orchestration、City),包含25项治理措施,并通过双向可追溯性对接EU AI Act、ISO/IEC 42001和NIST AI风险管理框架;同时引入五条冲突解决规则和一个基于自主性的激活模型,实现对多智能体城市场景下AI决策链条的有效监管与责任分配。
链接: https://arxiv.org/abs/2605.01091
作者: Talal Ashraf Butt,Muhammad Iqbal,Razi Iqbal
机构: Higher Colleges of Technology (阿联酋高等教育学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 24 pages, 3 figures, 8 tables. Submitted to Computer Law Security Review
Abstract:When a traffic signal controller adjusts green phases and a grid manager curtails power on the same corridor, each system may comply with its own obligations. The resident who suffers the combined effect has no single authority to hold accountable and, under the EU AI Act, limited means to obtain an explanation. Annex III, point 2 excludes safety-component AI in critical infrastructure from Article 86 explanation rights and Article 27 fundamental-rights impact assessment. Provider and deployer duties under Articles 9-15 still apply, and residual pathways under the GDPR, NIS2, and tortious liability offer partial coverage. The Act’s principal resident-facing accountability instruments are nonetheless narrowed for the autonomous infrastructure systems most likely to interact across agencies. The paper traces this accountability deficit through four residual pathways (GDPR Article 22, GDPR transparency obligations, tortious liability, and NIS2) and shows that each is structurally bounded by individual-controller, individual-decision scope. As a governance response, it presents AgentGov-SC, a three-layer architecture (Agent, Orchestration, City) specifying 25 governance measures with bidirectional traceability to the EU AI Act, ISO/IEC 42001, and the NIST AI Risk Management Framework. Five conflict resolution rules and an autonomy-calibrated activation model complete the design. A scenario analysis traces governance activation through a multi-agent corridor cascade involving three documented UAE smart-city systems, with a contrasting single-system scenario confirming proportional activation. The paper contributes a regulatory gap analysis and governance architecture for an increasingly important class of urban AI deployment that existing frameworks treat as bounded and isolated.
[MA-21] Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决未来密集城市空域中,多个公司运营异构小型无人航空系统(sUAS)机队时,因各机队内部同质但跨机队异构而导致的战术避撞策略难以收敛与公平性问题。核心挑战在于:如何在保障空域安全的前提下,使不同配置的sUAS机队达成冲突避免策略的均衡,并防止强配置机队对弱配置机队产生不公平优势。解决方案的关键在于采用一种基于注意力机制的近端策略优化优势Actor-Critic(PPOA2C)多智能体强化学习框架,使各机队独立训练策略并保持隐私,同时通过实验验证了该框架可实现跨机队策略均衡,且具备对规则基策略的安全适应能力;然而进一步分析表明,即使在相似配置下,策略类型差异仍可能导致均衡偏向某一类政策,凸显了未来需引入公平性感知的冲突管理机制以保障异构sUAS系统的协同效率与公平性。
链接: https://arxiv.org/abs/2605.01041
作者: Iman Sharifi,Hyeong Tae Kim,Maheed Hatem Ahmed,Mahsa Ghasemi,Peng Wei
机构: George Washington University (乔治华盛顿大学); Purdue University (普渡大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 3 figure, 1 table
Abstract:In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict-free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi-agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy-configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.
[MA-22] Breaking the Communication-Accuracy Trade-off: A Sparsified Information Diffusion Framework for Multi-Agent Collaborative Perception
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)在协同感知过程中通信负担过重的问题,尤其是在协同目标跟踪任务中如何在不牺牲跟踪性能的前提下降低数据传输量。解决方案的关键在于提出一种基于事件触发(Event-Triggered, ET)机制的快速且高精度扩散滤波算法(EDC-CIF),其核心创新包括:一是采用误差最小化的事件触发立方信息滤波器(Error-Minimized ET Cubature Information Filter, EDC-CIF)进行局部估计,以提升跟踪精度并减少不必要的通信;二是引入相关性感知的扩散策略(Correlation-Aware Diffusion Strategy)实现全局融合,从而优化信息共享效率。实验结果表明,该方法在显著提高通信效率的同时,有效降低了估计误差并加快了收敛速度,具备良好的可扩展性。
链接: https://arxiv.org/abs/2605.00946
作者: Jirong Zha,Chenyu Zhao,Nan Zhou,Zhenyu Liu,Tao Sun,Bin Zhang,Xiaochun Zhang,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Shenzhen Institute of Artificial Intelligence and Robotics (深圳人工智能与机器人研究院); Shenzhen Smart City Technology Development Group (深圳智慧城市科技发展集团)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:The growing relevance of multi-agent systems has drawn increasing focus on communication-efficient filters for collaborative perception to alleviate the system’s communication burden. While the event-triggered (ET) mechanism can improve communication efficiency in collaborative state estimation, an inevitable trade-off exists between estimation accuracy and communication cost in ET filters. This paper proposes a fast and accurate ET diffusion-based filter for real-time multi-agent collaborative target tracking, aiming to reduce the system’s data transmission without compromise in tracking performance. The proposed filter achieves improved tracking accuracy, reduced data transmission, and accelerated convergence using an error-minimized ET cubature information filter (CIF) for local estimation, and a correlation-aware diffusion strategy for global fusion. The experimental results confirm the scalability of the proposed EDC-CIF algorithm and demonstrate its efficacy in simultaneously reducing estimation error and computation time while significantly enhancing communication efficiency.
[MA-23] he Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
【速读】:该论文旨在解决多智能体辩论(multi-agent debate)在高难度任务中是否能有效过滤大语言模型(LLM)幻觉的问题,特别是针对同质化团队(homogeneous teams)的失败机制缺乏系统理解这一关键瓶颈。研究通过控制实验揭示了三种模型依赖性的失败路径:谄媚式从众(sycophantic conformity)、上下文脆弱性(contextual fragility)和共识崩溃(consensus collapse),并发现即使在最小交互密度(K=2)下,从众行为仍可高达85.5%,且辩论过程显著增加token消耗(2.1–3.4倍)而未带来准确率提升。其核心解决方案是提出“孤立自我修正”(isolated self-correction)作为更优替代策略,在相同或更高准确率下实现更好的成本-精度权衡,从而质疑无结构化同伴交流在同质群体中的有效性。
链接: https://arxiv.org/abs/2605.00914
作者: Blaž Bertalanič,Carolina Fortuna
机构: Jožef Stefan Institute (乔泽夫·斯特凡研究所)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 19 pages, ACM Conference on AI and Agentic Systems
Abstract:Multi-agent debate, where teams of LLMs iteratively exchange rationales and vote on answers, is widely deployed under the assumption that peer review filters hallucinations. Yet the failure dynamics of homogeneous debate remain poorly understood, therefore we report findings from a controlled empirical study of teams of N=10 homogeneous agents (Qwen2.5-7B, Llama-3.1-8B, Ministral-3-8B) across R=3 debate rounds on two high-difficulty benchmarks (GSM-Hard and MMLU-Hard). We compare peer debate against isolated self-correction and a stochastic noise control that injects rationales from unrelated problems. We decompose debate failure into three model-dependent pathways: sycophantic conformity, where agents uncritically adopt majority answers (modal adoption up to 85.5%); contextual fragility, where peer rationales destabilize previously correct reasoning (vulnerability rate up to 70.0%); and consensus collapse, where plurality voting discards correct answers already present in the generation pool (oracle gap up to 32.3 percentage points). Ablations over communication density ( K \in \2,4,9\ ) and sampling temperature ( T \in \0.4, 0.7\ ) show that conformity reaches high levels at minimal peer exposure ( K=2 ) and intensifies with greater initial diversity. Across all configurations, debate consumes 2.1-3.4 \times more tokens (up to 28,631 tokens per problem) than self-correction for equal or lower accuracy. Our results indicate that, within the 7-8B parameter class, homogeneous teams without structured roles do not benefit from unguided peer exchange, and that isolated self-correction consistently offers a more favorable cost-accuracy tradeoff.
[MA-24] ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations
【速读】:该论文旨在解决生成式 AI(Generative AI)在临床诊断场景中因幻觉问题导致的准确性不足,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)系统对证据处理过于平均化、无法体现临床重要性的问题。其解决方案的关键在于:首先,将临床指南结构化提取为语义单元(如推荐意见、表格、定义和叙述文本),并保留明确的来源信息;其次,基于临床意义和指南结构而非文本相似性对证据进行分层排序;最后,通过一个基于网页的界面提供简洁、可操作且可验证的答案,从而实现高可信度的临床支持。
链接: https://arxiv.org/abs/2605.00846
作者: Navapat Nananukul,Mayank Kejriwal
机构: USC Information Sciences Institute (南加州大学信息科学研究所)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Clinical diagnosis requires answers that are accurate, verifiable, and explicitly grounded in official guidelines. While large language models excel at natural language processing, their tendency to hallucinate undermines their utility in high-stakes medical contexts where precision is essential. Existing retrieval-augmented generation (RAG) systems treat all evidence equally, producing noisy context and generic answers misaligned with clinical practice. We present ClinicBot, an AI system that translates guideline recommendations into trustworthy clinical support through three key advances: (1) structured extraction of clinical guidelines into semantic units (recommendations, tables, definitions, narrative) with explicit provenance, (2) evidence prioritization that ranks content by clinical significance and guideline structure rather than textual similarity, and (3) a web-based interface that presents concise, actionable answers with verifiable evidence. We will demonstrate ClinicBot using diabetes questions from real patients and an additional diabetes risk assessment tool that is faithful to the American Diabetes Association (ADA) Standards of Care in Diabetes (2025). The demonstration will illustrate how semantic knowledge extraction and hierarchical evidence ranking can reliably operate in a multi-agent setting to process complex clinical guidelines at scale.
[MA-25] HepScript: A Dual-Use DSL for Human-AI Collaborative Data Analysis Workflows in High-Energy Physics
【速读】:该论文旨在解决高能物理(High-Energy Physics, HEP)领域中因数据规模激增而带来的分析效率瓶颈问题,特别是传统大型语言模型(Large Language Models, LLMs)在处理依赖深度领域知识且与实验特定代码库强耦合的复杂科学工作流时表现不佳的挑战。解决方案的关键在于提出一种名为HepScript的双用途领域特定语言(Domain-Specific Language, DSL),它通过抽象HEP分析逻辑为受限语法结构,既保持对人类专家的直观性,又具备AI代理可靠生成的能力。HepScript作为统一的形式化接口,将高层次分析意图转化为生产就绪代码,并显著减少人工编写代码量(降低93%),同时其受限文法定义了可计算的动作空间,使AI代理能够直接从已发表文献中自主生成核心分析阶段的可执行规范,成功率高达95%,从而构建了一个高效、可扩展的人机协同分析系统。
链接: https://arxiv.org/abs/2605.01423
作者: Junkun Jiao,Tong Liu,Ke Li,Weimin Song,Yipu Liao,Bolun Zhang,Beijiang Liu,Chang-Zheng Yuan,Yue Sun
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The escalating data scale in High-Energy Physics (HEP) fuels a growing aspiration for higher analytical efficiency. While Large Language Models (LLMs) offer a path toward automation via agentic AI, they struggle with complex scientific workflows that require deep domain knowledge and are tightly coupled to experiment-specific codebases. To address this, we introduce a methodology centered on HepScript, a dual-use Domain-Specific Language (DSL) for HEP data analysis workflows. HepScript serves as a shared formal interface, abstracting HEP analysis logic into a constrained syntax that is both intuitive for human experts and reliably generable by AI agents. First developed for the Beijing Spectrometer III (BESIII) experiment, HepScript hides the complexity of the underlying software stack, translating high-level analysis intent into low-level, production-ready code. In our case studies, this abstraction reduces the required human-written code by 93%. Crucially, HepScript’s constrained grammar defines a tractable action space, enabling AI agents to autonomously generate executable specifications for core analysis stages directly from published literature with a 95% success rate. Our work demonstrates a scalable pathway toward human-AI collaborative systems, where a formally specified DSL acts as an unambiguous translation layer between human expertise, AI automation, and production environment, rendering previously intractable automation problems solvable.
自然语言处理
[NLP-0] SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大语言模型(Large Language Model, LLM)推理速度慢的问题,特别是针对推测解码(Speculative Decoding)过程中固定推测长度(speculation length γ)导致的效率瓶颈。现有方法普遍采用固定 γ(如4),但实证表明最优 γ 值随任务类型和目标模型压缩级别(如FP16、INT8、NF4)变化显著。解决方案的关键在于提出一种轻量级自适应控制器 SpecKV,其通过分析草案模型(draft model)自身的输出信号(包括接受率、熵和置信度)动态调整每步的 γ 值,利用一个小型多层感知机(MLP)建模这些信号与预期每步产出 token 数之间的关系,从而最大化推理吞吐量。实验表明,SpecKV 在保持极低决策开销(0.34 ms/决策,占单步时间的0.5%)的前提下,相较固定 γ=4 的基线提升达56.0%,且统计显著(p < 0.001)。
链接: https://arxiv.org/abs/2605.02888
作者: Shikhar Shukla
机构: University of Kentucky (肯塔基大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
备注: 11 pages, 8 figures, 7 tables. Code and data available at: this https URL
Abstract:Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~ \gamma , which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed~ \gamma (typically~4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present \textbfSpecKV, a lightweight adaptive controller that selects~ \gamma per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4~task categories, 4~speculation lengths, and 3~compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal~ \gamma shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation~ \approx 0.56 ). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed- \gamma =4 baseline with only 0.34,ms overhead per decision ( 0.5% of step time). The improvement is statistically significant ( p 0.001 , paired bootstrap test). We release all profiling data, trained models, and notebooks as open-source artifacts.
[NLP-1] FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)在大型分析型数据库中面临的三大挑战:复杂模式(schema)的导航、模糊查询的歧义解析,以及决策需基于实际数据进行验证的问题。现有系统通常采用固定流水线,仅在初始阶段检索模式元素,后续修复依赖于事后修正,难以从早期错误中恢复。其解决方案的关键在于提出FlexSQL——一个具备灵活数据库交互能力的Text-to-SQL代理,允许在推理过程中任意时刻探索模式结构、检查数据值并执行验证查询;同时通过生成多样化的执行计划覆盖多种查询解释,并结合代码级与计划级两层修复机制实现回溯式纠错,从而显著提升准确率和鲁棒性。
链接: https://arxiv.org/abs/2605.02815
作者: Quang Hieu Pham,Yang He,Ping Nie,Canwen Xu,Davood Rafiei,Yuepeng Wang,Xi Ye,Jocelyn Qiaochu Chen
机构: University of Alberta (阿尔伯塔大学); Simon Fraser University (西蒙菲莎大学); University of Waterloo (滑铁卢大学); Snowflake (雪花公司); Princeton University (普林斯顿大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: this https URL
[NLP-2] Reinforcement Learning for LLM -based Multi-Agent Systems through Orchestration Traces
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)代理从孤立工具使用者向协同团队演进过程中,强化学习(Reinforcement Learning, RL)如何同时优化个体行为与多代理协作流程的问题。其核心挑战在于,RL不仅要决定单个代理的动作,还需优化任务的生成、委派、通信、聚合及终止等协调机制。解决方案的关键在于引入“编排轨迹”(orchestration traces)——一种包含子代理创建、委派、通信、工具调用、返回、聚合和停止决策等事件的时间交互图谱,并基于此识别出三个技术轴:奖励设计(涵盖8类编排奖励,如并行加速、拆分正确性与聚合质量)、信用分配(涉及从token到团队级别的8种信号单位,但消息级反事实信用仍稀缺)、以及编排学习的五项子决策(何时spawn、委派对象、通信方式、聚合策略、何时停止)。值得注意的是,在截至2026年5月4日的文献池中,尚未发现针对“何时停止”这一子决策的显式强化学习训练方法,这揭示了当前学术研究与工业实践(如Kimi Agent Swarm、OpenAI Codex、Anthropic Claude Code)之间的规模差距,而该差距源于部署范围与公开评估范式的差异,而非工业训练轨迹的不可验证性。
链接: https://arxiv.org/abs/2605.02801
作者: Chenchen Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at this https URL, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.02801 [cs.CL] (or arXiv:2605.02801v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.02801 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-3] FunFuzz: An LLM -Powered Evolutionary Fuzzing Framework
【速读】: 该论文旨在解决生成式 AI (Generative AI) 驱动的模糊测试(fuzzing)中因提示词(prompt)初始化敏感性和采样方差导致的探索效率低下及输入冗余问题。解决方案的关键在于提出一种多岛进化模糊测试框架 FunFuzz,其通过并行运行多个隔离的搜索进程(islands),周期性迁移高价值候选输入以维持种群多样性;同时基于文档自动构建初始提示词,并结合反馈引导的提示自适应机制持续优化生成策略;在测试过程中,利用增量编译器覆盖率优先选择候选输入,并借助编译器内部错误信号识别引发崩溃的输入,从而显著提升编译器模糊测试中的覆盖范围和独特失败触发能力。
链接: https://arxiv.org/abs/2605.02789
作者: Mario Rodríguez Béjar,B. Romera-Paredes,Jose L. Hernández-Ramos
机构: Universidad de Murcia (穆尔西亚大学); Hiverge (Hiverge)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 19 pages, 12 figures, 12 tables
Abstract:Modern fuzzers increasingly use Large Language Models (LLMs) to generate structured inputs, but LLM-driven fuzzing is sensitive to prompt initialization and sampling variance, which can reduce exploration efficiency and lead to redundant inputs. We present FunFuzz, a multi-island evolutionary fuzzing framework that runs several isolated searches in parallel and periodically migrates high-value candidates to maintain diversity. FunFuzz derives initial generation prompts from documentation and initializes islands with topic-specific instructions, then continuously adapts prompts using feedback-guided selection. During fuzzing, candidates are prioritized by incremental compiler coverage, while compiler-internal failure signals are used to identify crash-inducing inputs. We evaluate FunFuzz on compiler fuzzing, where inputs are source programs and success is measured by compiler coverage and unique compiler-internal failures. Across repeated 24-hour campaigns on GCC and Clang, FunFuzz achieves higher compiler coverage than previous LLM-driven baselines and discovers more unique failure-triggering inputs.
[NLP-4] When Audio-Language Models Fail to Leverag e Multimodal Context for Dysarthric Speech Recognition
【速读】: 该论文旨在解决当前自动语音识别(ASR)系统在非典型发音(如构音障碍,dysarthric speech)场景下性能脆弱的问题。现有音频-语言模型虽具备通过推理时引入临床上下文信息来提升识别准确率的潜力,但其是否能有效利用此类信息尚不明确。解决方案的关键在于构建一个基于Speech Accessibility Project (SAP) 数据集的基准测试框架,系统评估诊断标签、临床评分及多层级临床描述对识别准确率的影响,并进一步提出一种基于LoRA(Low-Rank Adaptation)的上下文感知微调方法,结合多种临床提示格式,在保留无上下文场景性能的前提下,实现词错误率(WER)显著降低(相对减少52%),尤其在唐氏综合征和轻度严重程度的说话者子群体中表现突出。
链接: https://arxiv.org/abs/2605.02782
作者: Pehuén Moure,Niclas Pokel,Bilal Bounajma,Yingqiang Gao,Roman Boehringer,Longbiao Cheng,Shih-Chii Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.
[NLP-5] Mitigating Misalignment Contagion by Steering with Implicit Traits
【速读】: 该论文旨在解决多智能体场景下语言模型(Language Models, LMs)因交互导致的对齐失效传播问题,即“对齐传染”(misalignment contagion)现象——当多个LM在多轮对话式社会困境博弈中互动时,原本对齐的行为倾向会逐渐向反社会方向偏移,尤其在其他参与者被引导为恶意行为时更为显著。解决方案的关键在于提出一种基于隐式特质(implicit traits)的引导策略:通过间歇性注入强化模型初始特质的系统提示语,而非简单重复原始系统提示,从而更有效地维持模型初始的亲社会行为。该方法无需访问模型参数或内部状态,适用于黑盒模型部署场景下的复杂多智能体流程设计。
链接: https://arxiv.org/abs/2605.02751
作者: Maria Chang,Ronny Luss,Miao Lui,Keerthiram Murugesan,Karthikeyan Ramamurthy,Djallel Bouneffouf
机构: IBM Research (IBM 研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM’s system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.
[NLP-6] Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
【速读】: 该论文旨在解决如何利用大规模真实世界数据(Real-World Data, RWD)构建可泛化、高精度的医疗基础模型(Healthcare Foundation Models),以提升疾病预测、费用预测及真实世界证据(Real-World Evidence, RWE)分析的能力。其核心解决方案是提出ReClaim——一个从零训练的生成式Transformer模型,基于438亿条来自2亿多参保人的MarketScan行政索赔数据(2008–2022年),建模诊断、操作、药物使用和支出的纵向轨迹。关键创新在于通过大规模参数扩展(1.7B参数)和预训练-微调范式,显著提升了罕见病预测性能(平均AUC达75.6%),并验证了模型在跨时间周期与独立数据集上的泛化能力,同时改善了医疗支出预测解释方差(从0.28升至0.37)和目标试验模拟中的系统性偏差(减少72%),从而确立了行政索赔数据作为医疗基础模型可靠训练底座的价值。
链接: https://arxiv.org/abs/2605.02740
作者: Fan Ma,Yuntian Liu,Xiang Lan,Weipeng Zhou,Jun Ni,Mauro Giuffrè,Lingfei Qian,Xueqing Peng,Yujia Zhou,Ruey-Ling Weng,Huan He,Lu Li,Qingyu Chen,Andrew Loza,Laila Rasmy,Degui Zhi,Yuan Lu,Chenjie Zeng,Joshua C Denny,Lee Schwamm,Daniella Meeker,Lucila Ohno-Machado,Yong Chen,Hua Xu
机构: MarketScan
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.
[NLP-7] PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
【速读】: 该论文旨在解决眼科领域视觉-语言模型(Vision-Language Models)开发中缺乏大规模、高质量图像-文本数据集的问题。现有数据集在图像分辨率、结构化信息完整性及标注精度方面存在不足,限制了模型性能提升。其解决方案的关键在于构建PubMed-Ophtha数据集,该数据集包含102,023个眼科图像-标题对,源自PubMed Central中的15,842篇开放获取文章;通过从PDF原文中直接提取全分辨率图像并分解为独立图板块(panel)、标注成像模态(如彩色眼底照相、光学相干断层扫描等)和标记状态(mark status),同时利用两阶段大语言模型(LLM)方法将图注拆分为面板级子图注,实现了高精度的图像与文本对齐,其中面板检测和图像检测的mAP@0.50分别达到0.909和0.892,图注分割BLEU得分达0.913,且整体图像提取的中位交并比(IoU)高达0.997,显著提升了数据质量与结构化程度,从而为后续研究提供可靠基础。
链接: https://arxiv.org/abs/2605.02720
作者: Verena Jasmin Hallitschke,Carsten Eickhoff,Philipp Berens
机构: University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 3 supplementary figures. Dataset available at this https URL . Code available at this https URL
Abstract:Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality – color fundus photography, optical coherence tomography, retinal imaging, or other – and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.
[NLP-8] mdok-style at SemEval-2026 Task 10: Finetuning LLM s for Conspiracy Detection SEMEVAL-2026 ACL2026
【速读】: 该论文旨在解决社交媒体评论中阴谋论信念的检测问题(conspiracy detection),即判断一条Reddit评论是否表达了阴谋论观点。其解决方案的关键在于采用基于数据增强和自训练(self-training)的方法,对Qwen3-32B大语言模型进行微调,以应对训练数据量较小的挑战;该方法源自机器生成文本检测领域,但实验证明其在阴谋论识别任务中同样有效,最终系统在SemEval-2026 Task 10中排名85百分位(共52个提交)。
链接: https://arxiv.org/abs/2605.02712
作者: Dominik Macko
机构: Kempelen Institute of Intelligent Technologies(凯姆佩伦智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the SemEval-2026 workshop of the ACL 2026 conference
Abstract:SemEval-2026 Task 10 is focused on conspiracy detection. Specifically, the goal is to detect whether a Reddit comment expresses a conspiracy belief. Our submitted mdok-style system utilizes data augmentation and self-training (to cope with a rather small amount of training data) to finetune the Qwen3-32B model for a binary text-classification task. The submitted system is very competitive, ranking in the 85th percentile (8th out of 52 submissions). The results shown that our approach, which originated in machine-generated text detection, can be used for conspiracy detection as well.
[NLP-9] mdok-style at SemEval-2026 Task 9: Finetuning LLM s for Multilingual Polarization Detection SEMEVAL-2026 ACL2026
【速读】: 该论文旨在解决多语言在线极化(multilingual polarization)的检测问题,其核心挑战在于识别跨语言、跨文化及多事件场景下的极化现象,并在三个维度(检测、类型与表现形式)上进行细粒度建模。为应对这一问题,研究者采用参数高效微调技术QLoRA对中等规模大语言模型(LLM)进行序列分类任务的微调,同时通过引入匿名化、大小写变换和同形异义字符(homoglyph)增强的数据扩增策略,提升了模型在22种语言上的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2605.02695
作者: Dominik Macko,Alok Debnath,Jakub Simko
机构: Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia; ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the SemEval-2026 workshop of the ACL 2026 conference
Abstract:SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation. Therefore, its detection before it escalates is crucial for a safer and more inclusive online space. We have coped with this SemEval task by finetuning mid-size LLMs for the sequence-classification task using the QLoRA parameter-efficient finetuning technique. The training data augmented the multilingual (22 languages) training sets by anonymized, lower-cased, upper-cased, and homoglyphied counterparts, making the detection more robust.
[NLP-10] Fuzzy Fingerprinting Encoder Pre-trained Language Models for Emotion Recognition in Conversations: Human Assessment and Validity Study
【速读】: 该论文旨在解决情感识别在对话(Emotion Recognition in Conversations, ERC)任务中,基于预训练语言模型(Pre-trained Language Models, PLMs)的分类决策缺乏可解释性、且在类别不平衡数据集上易将少数类情感误判为中性情绪的问题。其解决方案的关键在于引入一种新颖的可解释方法——模糊指纹(Fuzzy Fingerprints, FFP),通过在PLM隐空间中构建反映特定情感激活模式的类属原型(class-specific prototypes),利用对训练样本中对话上下文嵌入的排序与模糊化处理得到FFP;推理时,输入话语同样被模糊指纹化,并通过基于模糊集合交集聚合的相似度函数与各情感原型匹配,从而实现高精度且具备可解释性的ERC分类。
链接: https://arxiv.org/abs/2605.02665
作者: Patrícia Pereira,Helena Moniz,Joao Paulo Carvalho
机构: Inesc-ID (INESC ID); Universidade de Lisboa (University of Lisbon)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In Emotion Recognition in Conversations (ERC), model decisions should align with nuanced human perception and ideally provide insights on the classification process. Standard encoder pre-trained language models (PLMs) are the state-of-the-art at these tasks but offer little insight into why a certain prediction is made. This is especially problematic in imbalanced datasets, where most utterances are labeled as neutral, making these models frequently misclassify minority emotions as the majority neutral class. To tackle this issue, we introduced a novel, interpretable approach to ERC by combining PLMs with Fuzzy Fingerprints (FFPs). FFP provide class-specific prototypes that reflect the characteristic class activation patterns in the PLM’s latent space. They are derived by ranking and fuzzifying the activations of the pooled conversational context-dependent embeddings across training instances for each emotion. At inference time, each input utterance is similarly fuzzy fingerprinted and matched to the emotion prototypes using a fuzzy similarity function based on the aggregation of the intersection of the fuzzy sets that define each FFP. Experimental results show that FFP integration reduces overclassification into the neutral class and human evaluation further supports the adequacy of FFP predictions. Our proposed method thus bridges the gap between deep neural inference and human perception, performing at state-of-the-art level while simultaneously offering valuable insights into the classification procedure.
[NLP-11] ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面面临的上下文诱导型越狱攻击(contextual jailbreak attacks)问题,即通过多轮对话中的隐式提示(contextual priming)引导模型产生有害响应。传统红队测试方法主要局限于单轮静态提示优化,难以有效探索复杂、动态的对话场景中引发模型合规行为的机制。本文提出ContextualJailbreak,一种基于进化搜索的黑盒红队策略,其核心创新在于:在模拟的多轮对话环境中对提示进行演化搜索,并利用两级裁判系统提供的0–5分危害评分作为闭环反馈信号,使部分有害响应也能参与搜索过程而非被丢弃;同时设计了五种语义明确的变异算子(roleplay、scenario、expand、troubleshooting、mechanistic),其中“troubleshooting”和“mechanistic”为新提出的方法,用于生成具有针对性的上下文诱导结构。该方案显著提升了对多个主流开源与闭源模型的攻击成功率(ASR达90–100%),并揭示了不同厂商模型间存在明显的对齐鲁棒性差异。
链接: https://arxiv.org/abs/2605.02647
作者: Mario Rodríguez Béjar,Francisco J. Cortés-Delgado,S. Braghin,Jose L. Hernández-Ramos
机构: Universidad de Murcia(穆尔西亚大学); IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 19 pages, 6 figures, 9 tables
Abstract:Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a simulated multi-turn primed dialogue. The strategy leverages a graded 0-5 harm score from a two-level judge as an in-loop signal, enabling partially harmful responses to guide the search process rather than being discarded. Search is driven by five semantically defined mutation operators: roleplay, scenario, expand, troubleshooting, and mechanistic, of which the last two are novel contributions of this work. Across 50 representative HarmBench behaviors, ContextualJailbreak achieves an ASR of 100% on gpt-oss:20B, 100% on qwen3-8B, 100% on llama3.1:70B, and 90% on gpt-oss:120B, outperforming four single- and multi-turn baselines by 31-96 percentage points on average. The 40 maximally harmful attacks discovered against gpt-oss:120B transfer without adaptation to closed frontier models, achieving 90.0% on gpt-4o-mini, 70.0% on gpt-5, and 70.0% on gemini-3-flash, but only 17.5% on claude-opus-4-7 and 15.0% on claude-sonnet-4-6, revealing a pronounced provider-level asymmetry in alignment robustness. Comments: 19 pages, 6 figures, 9 tables Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2605.02647 [cs.CL] (or arXiv:2605.02647v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.02647 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mario Rodriguez Bejar [view email] [v1] Mon, 4 May 2026 14:32:40 UTC (7,343 KB) Full-text links: Access Paper: View a PDF of the paper titled ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming, by Mario Rodr’iguez B’ejar and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-12] Mapping Discourse Reframing: A Multi-Layer Network Approach to Italian HPV Vaccine Discourse on X (2010-2024)
【速读】: 该论文旨在解决在线叙事在群体间传播过程中,信息失序(information disorder)的早期识别难题,尤其是传统计算分析方法因采用保守的网络构建策略而忽略初始稀疏但关键的信号问题。其解决方案的关键在于提出一种多层框架,通过双层结构实现对低频信号的捕捉:第一层利用保守的社区检测识别出稳定的主流预防性话语联盟,揭示出与日益分离的怀疑论联盟之间的对比;第二层引入“覆盖层”(coverage layer),基于加权连接度将边缘话题标签投影到核心联盟中,从而有效恢复长尾的、具有问题性的标签,同时保持可解释的联盟结构。该方法显著提升了对信息失序扩散路径的追踪能力,尤其适用于刻画极化叙事随时间演变的结构性成熟过程。
链接: https://arxiv.org/abs/2605.02629
作者: Lorella Viola
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Understanding how online narratives travel through coalitions is critical for identifying information disorder, yet computational analyses often rely on conservative network constructions that erase initially sparse but salient signals. This paper proposes a novel multi-layer framework that captures low-frequency signals of emerging information disorder allowing for locating where online discourse is reframed and amplified over time. The use case is 14 years of Italian discourse on X regarding the Human Papillomavirus (HPV) vaccine across three pivotal epochs (2010-2024). Utilizing hashtag co-occurrence networks, we introduce a dual-layer approach. We first identify robust core discourse coalitions through conservative community detection, revealing a stable prevention-oriented backbone contrasted with increasingly separable skepticism coalitions. We then introduce a coverage layer and project fringe hashtags into core coalitions based on weighted connectivity. Using a manually labelled set of skeptical and conspiratorial seed tweets, we demonstrate that this core-coverage projection significantly improves the recovery of long-tail, problematic hashtags while preserving an interpretable coalition structure. Our findings characterize the structural maturation of polarized narratives and provide a methodology for mapping how discourse is reframed and amplified by information disorder over time.
[NLP-13] Synthetic Users Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
【速读】: 该论文旨在解决当前用户模拟(user simulation)在AI聊天机器人评估中 realism(真实性)不足的问题,即如何更准确地衡量模拟对话与真实用户-聊天机器人交互之间的差异。现有方法通常仅从单个对话层面提供粗粒度的质量信号,难以支撑严谨的评估。论文提出了一种名为 realsim 的评估框架,其关键在于引入分布视角(distributional view),从8个维度系统性对比真实与模拟对话,涵盖交互的交际功能、用户状态及话语表层形式等属性。通过构建包含1000个多轮任务导向真实对话的基准数据集,研究发现模拟用户普遍难以捕捉真实用户带来的沟通摩擦(communication frictions),导致基于此类模拟的评估结果可能过于乐观,并且不同应用场景下的表现存在显著差异,提示未来需开发领域特定的用户模拟器。
链接: https://arxiv.org/abs/2605.02624
作者: Yu Lu Liu,Hyokun Yun,Tanya Roosta,Ziang Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:There is growing interest in exploring user simulation as an alternative to gathering and scoring real user-chatbot interactions for AI chatbot evaluation. For this purpose, it is important to ensure the realism of the simulation, i.e., the extent to which simulated dialogues reflect real dialogues users have with chatbots. Most existing methods evaluating simulation realism produce coarse quality signal and remain solely at the level of individual dialogues. To support more rigorous evaluation in this area, we propose realsim, an evaluation framework that enables practitioners to take a distributional view of real vs. simulated dialogues along 8 dimensions, covering attributes related to the communicative functions of the interaction, user states, and the surface form of user messages. We then instantiate the framework with a curated dataset of 1K multi-turn task-focused real user-chatbot dialogues that cover 16 domains of chatbot applications. Overall, we find that simulated users tend to struggle at capturing communication frictions that real users introduce to interactions, which could make evaluations based on such simulations overly optimistic. We also observe variability in performance across different domains, which may indicate a need for domain-specific user simulators.
[NLP-14] Beating the Style Detector: Three Hours of Agent ic Research on the AI-Text Arms Race
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)研究中可复现性差与人类后编辑(post-editing)效能有限的问题。其核心挑战在于:一方面,传统NLP实验复现耗时长、成本高;另一方面,尽管人类对模型生成文本进行风格调整(personal-style post-editing)仍具价值,但其效果受限于主观性和效率瓶颈。解决方案的关键在于构建一个“代理驱动的研究框架”(agentic-research harness),使人类研究人员仅作为评审者参与,而由自动化代理完成全部实验执行和迭代优化。在此框架下,作者不仅成功复现了ACL 2026论文的全部七个预注册假设,并精确复现了感知自相似性与嵌入测量自相似性之间的强相关性(r=+0.244, p<10⁻⁸);更重要的是,他们发现前沿大语言模型(LLM)如GPT-5.5和Claude Opus在风格差距缩小上显著优于人类后编辑(达71–75% vs. 24%),且在80%任务中表现更优;同时通过对抗性训练揭示了AI检测器存在可被规避的风险——当固定检测器并允许代理进行20轮反馈迭代时,Opus代理能将部分伪造文本从AI空间转移到人类空间,且所有边际缩小一个数量级,表明当前检测机制尚未具备鲁棒性。这一方法论创新为高效、透明、可扩展的NLP研究提供了新范式。
链接: https://arxiv.org/abs/2605.02620
作者: Andreas Maier,Moritz Zaiss,Siming Bayer
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to RRPR 2026
Abstract:Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL,2026 study on personal-style post-editing of LLM drafts – and add three new ones – with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper’s headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ( r=+0.244 , p10^-8 , n=648 ). Under a leakage-free held-out protocol, GPT-5.5 and Claude,Opus,4.7 close 71 – 75,% of the style gap to the same-author ceiling on 324 paired tasks, against 24,% for the human post-edit, and beat the human post-edit on \sim 80,% of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC 0.93 – 1.00 across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given T=20 feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, 648 mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.
[NLP-15] Dependency Parsing Across the Resource Spectrum: Evaluating Architectures on High and Low-Resource Languages
【速读】: 该论文旨在解决在低资源语言环境下,基于Transformer的依赖句法分析模型是否仍能优于传统简单架构(如Biaffine LSTM)这一问题。研究发现,在训练数据稀缺时,Biaffine LSTM在十种类型多样、尤其是非洲低资源语言上表现更优;而随着语料规模增加,Transformer模型(如AfroXLMR-large和RemBERT)才逐渐展现出其优势。关键在于:当标注数据不足时,Biaffine LSTM凭借更强的泛化能力与更低的过拟合风险,成为更适合低资源场景下的语法工具开发方案;而morphological complexity(通过MATTR指标衡量)是影响Transformer相对劣势的重要次级因素。
链接: https://arxiv.org/abs/2605.02608
作者: Kevin Guan,Happy Buzaaba,Christiane Fellbaum
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based models achieve state-of-the-art dependency parsing for high-resource languages, yet their advantage over simpler architectures in low-resource settings remains poorly understood. We evaluate four parsers – the Biaffine LSTM, Stack-Pointer Network, AfroXLMR-large, and RemBERT – across ten typologically diverse languages, with a focus on low-resource African languages. We find that the Biaffine LSTM consistently outperforms transformer models in low-resource regimes, with transformers recovering their advantage as training data increases. The crossover falls within a resource range typical of treebanks for under-resourced languages. Morphological complexity (measured via MATTR) emerges as a significant secondary predictor of transformers’ relative disadvantage after controlling for corpus size. These results indicate that the Biaffine LSTM may be better suited for syntactic tool development in low-resource regimes until sufficient annotated data is available to leverage the representational capacity of pre-trained transformers.
[NLP-16] SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures SEMEVAL-2026 SEMEVAL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)和自然语言处理(Natural Language Processing, NLP)系统在多语言与多文化环境下的适应性评估问题,尤其关注低资源语言和代表性不足文化场景中的表现。其解决方案的关键在于构建了一个扩展的、人工标注的BLEnD基准数据集,覆盖30余种语言-文化组合,且严格限定仅用于评估目的(禁止训练、微调或任何形式的模型修改),从而确保公平性和客观性;同时设计了两类评测任务——短答案问答(Short-Answer Questions, SAQ)和多项选择题(Multiple-Choice Questions, MCQ),并鼓励多样化的建模策略,最终通过62支团队的系统提交和19篇系统描述论文,揭示了当前模型在跨语言文化适配中的性能边界与共性挑战。
链接: https://arxiv.org/abs/2605.02601
作者: Nedjma Ousidhoum,Junho Myung,Carla Perez-Almendros,Jiho Jin,Amr Keleg,Meriem Beloucif,Yi Zhou,Rodrigo Agerri,Vladimir Araujo,Naomi Baes,James Barry,Joanne Boisson,Nancy F. Chen,Christine de Kock,Aleksandra Edwards,Joseba Fernandez de Landa,Mohamed Fazli Imam,Huda Hakami,Shu-Kai Hsieh,Joseph Marvin Imperial,Roy Ka-Wei Lee,Zhengyuan Liu,Chenyang Lyu,Younes Samih,Johan Sjons,Bryan Tan,Asahi Ushio,Weihua Zheng,Alice Oh,Jose Camacho-Collados
机构: Cardiff University; KAIST; MBZUAI; Uppsala University; HiTZ Center, University of the Basque Country EHU; Sailplane AI; University of Melbourne; IBM Research; Agency for Science, Technology and Research (A*STAR), Singapore; Taif University; National Taiwan University; National University Philippines; University of Bath; Singapore University of Technology and Design; Alibaba; Google
类目: Computation and Language (cs.CL)
备注: SemEval-2026 Task Description Paper. Data and resources are available at \url{ this https URL
Abstract:We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification. Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and we received final submissions from 62 teams, along with 19 system description papers. We report the results and present an analysis of the best-performing systems and the most commonly adopted approaches. Furthermore, we discuss shared insights into open questions and challenges related to evaluation, misalignment, and methodological perspectives on model behaviour in low-resource languages and for under-represented cultures.
[NLP-17] Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自然语言处理中普遍依赖隐式语义表示、缺乏显式结构约束与系统性解释机制的问题,尤其是在语义角色标注(Semantic Role Labeling, SRL)任务中。其核心解决方案是引入一种现代化的基于编码器(encoder-based)的SRL框架,该框架在保持显式谓词-论元结构的同时,实现了推理速度提升10倍,并通过引入依赖信息(dependency-informed)的诊断方法增强结构稳定性。关键创新在于将显式结构建模与现代预训练语言模型(如RoBERTa和DeBERTa)结合,在不牺牲预测性能的前提下显著提升结构一致性与可解释性,同时支持多语言SRL迁移作为下游应用。
链接: https://arxiv.org/abs/2605.02505
作者: Sangpil Youm,Leah Jones,Bonnie J. Dorr
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 9 figures
Abstract:Semantic Role Labeling (SRL) provides an explicit representation of predicate-argument structure, capturing linguistically grounded relations such as who did what to whom. While recent NLP progress has been dominated by large language models (LLMs), these systems often rely on implicit semantic representations, often lacking explicit structural constraints and systematic explanatory mechanisms. Traditionally, SRL systems have often relied on AllenNLP; however, the framework entered maintenance mode in December 2022, limiting compatibility with evolving encoder architectures and modern inference requirements. We revisit structured SRL modeling, introducing a modernized encoder-based framework that preserves explicit predicate-argument structure while enabling inference 10 times faster. Using BERT-base, the model attains comparable predictive performance, and RoBERTa and DeBERTa further improve F1 performance within the same framework. We adopt a dependency-informed diagnostic methodology to characterize span-level inconsistencies and conduct a representation-level analysis of LLM behavior under dependency-informed structural signals. Results indicate that dependency cues primarily improve structural stability. Finally, we illustrate how the framework’s explicit predicate-argument structure can support multilingual SRL projection as a downstream application.
[NLP-18] A multilingual hallucination benchmark: MultiWikiQHalluA
【速读】: 该论文旨在解决生成式 AI 在低资源语言中幻觉(hallucination)现象的评估不足问题,即现有大多数幻觉检测研究集中于英语,难以确定其结论是否适用于其他语言环境。其解决方案的关键在于构建一个多语言幻觉数据集并训练细粒度的 token-level 幻觉分类器:作者基于 MultiWikiQA 数据集和 LettuceDetect 框架,生成了覆盖 306 种语言的合成幻觉数据,并从中为 30 种欧洲语言训练出高精度的幻觉识别模型;随后利用这些分类器对多个主流大模型在英文、丹麦语、德语和冰岛语等语言上的输出进行系统性评估,揭示出幻觉率随模型规模下降、且显著高于低资源语言(如冰岛语)的现象。
链接: https://arxiv.org/abs/2605.02504
作者: Freja Thoresen,Dan Saattrup Smart
机构: 未知
类目: Computation and Language (cs.CL)
备注: Camera-ready version for RESOURCEFUL 2026
Abstract:Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages. In this work, we present evaluations of model hallucinations on a selection of languages: English, Danish, German, and Icelandic. Using these classifiers, we evaluate the hallucination rates for Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Our classifiers reveal notably higher hallucination rates for Qwen3-0.6B (up to 60% of answers containing at least one hallucination, peaking in Icelandic) and generally lower rates for larger models, with cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B performing best on most languages. Hallucination rates are consistently higher for lower-resource languages, particularly Icelandic.
[NLP-19] betan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
【速读】: 该论文旨在解决藏文文本到语音(Tibetan text-to-speech, TTS)系统在资源稀缺、方言差异显著以及书面语与发音映射复杂等挑战下的性能瓶颈问题。解决方案的关键在于:首先构建基于大模型(large-model-based)的藏文TTS系统,依托Xingchen AGI Lab开发的大规模语音合成模型;其次通过数据质量增强、面向藏文的文本表示与分词器适配,以及跨语言自适应训练策略,有效提升低资源条件下的语音合成稳定性与自然度。实验表明,该方案在主观评价中达到4.28–4.35的平均意见分数(MOS),且发音准确率高达97.6%,显著优于现有商用接口,为未来统一多方言藏语语音合成提供了技术基础。
链接: https://arxiv.org/abs/2605.02496
作者: Jiaxu He,Chao Wang,Jie Lian,Yuqing Cai,Yongxiang Li,Renzeg Duojie,Jie Li
机构: Xingchen AGI Lab, China Telecom Artificial Intelligence Technology Co. Ltd; Xizang University; Qinghai Normal University; University of Electronic Science and Technology of China
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis. Experimental results show that the system can generate stable, natural, and intelligible Tibetan speech under low-resource conditions. In subjective evaluation, the MOS scores of the syllable-level and BPE-based systems reach 4.28 and 4.35, while their pronunciation accuracies reach 97.6% and 96.6%, respectively, outperforming an external commercial Tibetan TTS interface. These results demonstrate that combining a large-model backbone with Tibetan-oriented text representation adaptation and cross-lingual adaptive training enables highly usable low-resource Tibetan speech synthesis, and also provides a technical foundation for future unified multi-dialect Tibetan speech synthesis.
[NLP-20] Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
【速读】: 该论文旨在解决如何将叙事内容转化为可计算的结构化模型,并在其中实现因果推理与情感状态建模的问题。其核心挑战在于将故事中的因果关系、反事实推理与读者的情感体验(如悬念、惊奇等)进行形式化表达和量化评估。解决方案的关键在于构建一个名为Shadow-Loom的开源框架,该框架通过版本化的图形世界模型(graphical world model)统一表示叙事内容,其中两个引擎分别运行:一是基于Pearl因果阶梯(ladder of causation)的因果物理系统,用于处理真实世界的因果逻辑;二是基于祖先多世界网络(Ancestral Multi-World Networks)的反事实微积分系统,用于生成和评估假设性情节变化;同时引入叙事物理(narrative physics),基于Sternberg的悬念/惊奇三角理论,对同一图结构在四种读者心理状态(神秘感、戏剧性反讽、悬念、意外)下的得分进行计算,从而实现对故事结构与读者情绪响应的联合建模。整个过程仅在边界使用大语言模型(LLM)进行文本提取与渲染,而核心推理任务由类型化代码完成,确保可解释性和可控性。
链接: https://arxiv.org/abs/2605.02475
作者: David Wilmot
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 28 pages total
Abstract:Stories hold a reader’s attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl’s ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi-World Networks; and a narrative physics that scores the same graph against four structural reader-states – mystery, dramatic irony, suspense, and surprise – in the tradition of Sternberg’s curiosity/suspense/surprise triad, with suspense formalised in the structural-affect line of work on story comprehension and computational suspense. Large language models are used only at the boundary: extraction, rendering, and audit; identification, intervention, and counterfactual reasoning are carried out in typed code over the graph. The system is offered as a research artefact rather than as a benchmarked NLP model; code, fixtures, and pipeline are released open source.
[NLP-21] Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication ACL2026
【速读】: 该论文旨在解决法律文本中计算型条款(computational legal clauses)在实际应用中因大型推理模型(Large Reasoning Models, LRM)存在推理错误和高推理成本而导致的生产系统难以部署的问题。其解决方案的关键在于提出一种称为“ amortized intelligence”的神经符号方法:利用一次性的大语言模型(LLM)将法律文本翻译为确定性自主合约语言(Deterministic Autonomous Contract Language, DACL)——一种带类型的图结构中间表示;随后的判决过程基于该图结构进行确定性执行,并提供可视化的可审计追踪路径,从而实现近乎完美的逻辑一致性,显著降低计算开销(超过90%),同时满足法律领域对可审计性的严格要求。
链接: https://arxiv.org/abs/2605.02472
作者: Stanisław Sójka,Witold Kowalczyk
机构: Delos AI Inc.
类目: Computation and Language (cs.CL)
备注: 14 pages (7 pages main text), 3 figures. Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026) - Industry Track
Abstract:Legal texts often contain computational legal clauses–provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation. Adjudication then relies on deterministic graph executions with a visually auditable trace. In comparison against runtime LRM baselines (including GPT-5.2 and Gemini 3 Pro), our DACL-based Agent achieves near-perfect consistency and mitigates the “reasoning cliff” observed in probabilistic models. The system reduces compute costs by over 90% in high-volume workflows while satisfying the strict auditability requirements of legal adjudication.
[NLP-22] ATLAS: Article Tracking Linking and Analysis of Swedish Encyclopedias
【速读】: 该论文旨在解决历史文献数字化过程中结构信息缺失的问题,特别是针对多版次百科全书(如瑞典权威百科《Nordisk familjebok》)中难以追踪知识演变与实体关联的挑战。其核心解决方案是一个端到端的自动化处理流水线,关键步骤包括:从原始文本中提取词头(headword)并识别条目、对实体进行分类、跨版本匹配条目以及将条目链接至Wikidata知识库。该方法实现了高精度的词头提取(F1=97.8%)和分类(F1=93.4%),并在小规模评估中达到93%的跨版匹配准确率及85%的Wikidata链接准确率,证明了自动化恢复历史知识结构的可行性,从而促进通用知识的保存与知识传播机制的理解。
链接: https://arxiv.org/abs/2605.02466
作者: Albin Andersson,Salam Jonasson,Fredrik Wastring,Pierre Nugues
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures
Abstract:The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textitNordisk familjebok, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8% and we obtained an F1 score of 93.4% on the headword classification. On a small-scale evaluation, we reached a 93% precision on the cross-edition matching, 85% precision and 16.5% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.
[NLP-23] Leverag ing Argument Structure to Predict Content Hatefulness
【速读】: 该论文旨在解决信息失序(information disorder)问题,特别是在线平台上误导性、虚假性和仇恨内容的传播。其核心挑战在于不同维度(如仇恨言论、虚假信息、误导信息等)之间存在复杂关联,需综合手段应对。解决方案的关键在于利用论证结构(argument structure),通过分析白人至上主义论坛消息中的前提(premise)与结论(conclusion)之间的逻辑关系,并结合对论证组件的可信度(checkworthiness)和仇恨程度(hatefulness)标注,从而推断整个消息的仇恨倾向。实验结果表明,该方法在识别仇恨内容方面具有高准确率(最高达96% F1),为未来应对信息失序提供了可扩展的新路径。
链接: https://arxiv.org/abs/2605.02457
作者: Nicolas Benjamin Ocampo,Davide Ceolin
机构: Centrum Wiskunde & Informatica, Amsterdam, The Netherlands(荷兰数学与计算机科学研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Information disorder is a challenging phenomenon that affects society at large. This phenomenon entails the diffusion of misleading, misinforming, and hateful content online. In different contexts, one aspect of the problem may prevail, but overall, this is a broad problem that requires comprehensive solutions. While each dimension of the problem (hate speech, disinformation, misinformation, etc.) requires in-depth analysis, in this paper, we look into the possibility of argument structure to provide relevant information to link these different areas of the problem. In particular, we focus on the WSF-ARG+ dataset, which consists of white supremacy forum messages annotated in terms of argument structure (premises and conclusion). There, we leverage the checkworthiness and hatefulness annotations of the argument components to obtain insights into the hatefulness of the whole message. Our results show promising insights (up to 96% F1), indicating the possibility of extending this direction in the future to tackle hateful content identification and information disorder countering.
[NLP-24] PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention
【速读】: 该论文旨在解决多模态讽刺检测(Multimodal Sarcasm Detection)中因文本与非语言线索之间语用不一致而导致的精准识别难题。现有方法受限于朴素的相似性注意力机制和统一的晚期融合策略,难以有效建模跨模态的细微语用矛盾。其解决方案的关键在于提出一种标量一致性路由机制(scalar congruity routing mechanism)和先验引导的上下文图(prior-guided contextual graph),通过两阶段非对称优化驱动的一致性感知对比学习锚定广义不一致流形,从而选择性融合最具判别力的多粒度证据,实现对原子级、组合级和上下文级冲突的解耦建模,显著提升了模型性能,在MUStARD基准上相比最强基线提升了3.14%的Macro-F1。
链接: https://arxiv.org/abs/2605.02447
作者: Maoheng Li,Ling Zhou,Xiaohua Huang,Rubing Huang,Wenming Zheng,Guoying Zhao
机构: Macau University of Science and Technology (澳门科技大学); Nanjing Institute of Technology (南京工程学院); Macau University of Science and Technology Zhuhai MUST Science and Technology Research Institute (澳门科技大学珠海科技研究院); Southeast University (东南大学); University of Oulu (奥卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on na"ıve similarity-based attention mechanisms and uniform late fusion this http URL, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-granularity evidence. Extensive experiments on the \textttMUStARD benchmark and its spurious-correlation-mitigated balanced datasets demonstrate that our approach achieves new state-of-the-art performance, surpassing the strongest multimodal baseline by a substantial 3.14% improvement in Macro-F1. By architecturally isolating atomic, composition, and contextual conflicts. This work provides a robust, decoupled paradigm for modeling subtle pragmatic incongruities in human communication.
[NLP-25] HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言处理任务中普遍存在幻觉(hallucination)的问题,即模型生成与事实不符、脱离上下文或违背用户指令的内容。为系统评估和缓解幻觉现象,作者提出HalluScan框架,其核心贡献包括:(1)引入HalluScore这一复合指标,与人类专家判断具有较高的皮尔逊相关系数(r = 0.41),有效量化幻觉检测性能;(2)设计自适应检测路由算法(Adaptive Detection Routing, ADR),在仅损失0.1% AUROC的前提下实现2.0倍的成本降低;(3)通过系统性误差级联分解揭示不同领域间幻觉类型存在显著差异。实验表明,基于自然语言推理(NLI)验证的方法在整体上表现最优(AUROC = 0.88),而基于检索增强验证(RAV)的方法次之(AUROC = 0.66)。
链接: https://arxiv.org/abs/2605.02443
作者: Ahmed Cherif
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages, 13 figures, 10 tables. Submitted to Neural Computing and Applications
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations – generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.
[NLP-26] Measuring AI Reasoning : A Guide for Researchers
【速读】: 该论文旨在解决当前语言模型推理能力评估中过度依赖最终答案准确率的问题,指出这一指标无法有效诊断或调试前沿模型中产生个体解的底层推理过程。其解决方案的关键在于推动从结果导向向过程导向的评估范式转变,即通过检验中间推理步骤(intermediate reasoning traces)的忠实性(faithfulness)与有效性(validity)来评估推理能力,从而将推理视为一种基于输入条件自适应、多步搜索的过程,并主张采用中间解码和外部化推理轨迹作为合适的评估接口。
链接: https://arxiv.org/abs/2605.02442
作者: Munachiso Samuel Nwadike,Zangir Iklassov,Kareem Ali,Rifo Genadi,Kentaro Inui
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 3 figures
Abstract:In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces. Central to our argument is that final-answer accuracy alone is an insufficient measure of reasoning, because it provides little ability to diagnose or debug the underlying processes that produce individual solutions in frontier models. We therefore argue for a shift toward process-based evaluation, in which reasoning is assessed through the faithfulness and validity of intermediate reasoning traces as first-class evaluation targets.
[NLP-27] Automatic Reflection Level Classification in Hungarian Student Essays
【速读】: 该论文旨在解决匈牙利语学生反思性写作自动分类的问题,即如何在资源有限的语言环境中高效、客观地评估学生的反思能力水平。其关键解决方案在于:一方面采用基于TF-IDF和语义嵌入特征的传统机器学习模型,通过精心设计的特征工程实现较高的整体性能(综合准确率、F1分数和ROC AUC达71%);另一方面引入针对匈牙利语优化的Transformer模型进行文档级分类,尽管整体得分略低(68%),但对少数类别的泛化能力更强。此外,研究系统性地比较了类别不平衡处理策略(如类别权重调整、过采样、数据增强和损失函数优化),揭示了传统方法在低资源场景下的有效性与Transformer模型在不均衡数据上的鲁棒性,为匈牙利语及其他形态丰富的语言的自动化反思分析提供了可复用的数据集与实验基准。
链接: https://arxiv.org/abs/2605.02402
作者: Zsolt Csibi,Mónika Sándor,Mónika Serfőző,Kinga Gyöngy,Kristian Fenech
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reflective thinking is a key competency in education, but assessing reflective writing remains a time-consuming and subjective task for education experts. While automated reflective analysis has been explored in several languages, Hungarian language was not researched extensively. In this paper, we present the first comprehensive study on automatic reflection level classification in Hungarian student essays. We used a large, expert-annotated Hungarian dataset consisting of 1,954 reflective essays collected over multiple academic years and labeled on a four-level reflection scale. We investigate two approaches: (1) classical machine learning models using TF-IDF and semantic embedding features, and (2) Hungarian-specific transformer models fine-tuned for document-level reflection classification. To address the strong class imbalance in the dataset, we systematically examine class weighting, oversampling, data augmentation, and alternative loss functions. An extensive ablation study is conducted to analyze the contribution of each modeling and balancing strategy. Our results show that shallow machine learning models with appropriate feature engineering achieve strong overall performance, reaching up to 71% overall score averaged over accuracy, F1-score, and ROC AUC metrics, while transformer-based models achieve slightly lower overall score (68%) averaged over the same metrics, but demonstrate better generalization on minority reflection classes. These findings highlight the continued relevance of classical methods for low-resource settings and the robustness of transformer models for imbalanced classification. The proposed dataset and experimental insights provide a solid foundation for future research on automated reflective analysis in Hungarian and other morphologically rich languages.
[NLP-28] he Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
【速读】: 该论文旨在解决前沿生成式 AI (Generative AI) 模型在高风险决策流程中面临的一个关键安全问题:即在对抗性压力下维持元认知稳定性(metacognitive stability)的能力,尤其是模型对自身知识边界的认知能力是否会崩溃。现有安全评估主要关注策略性欺骗(scheming),而本文首次系统揭示了一种更根本的失败模式——认知崩溃(cognitive collapse),表现为模型在受到强制合规指令干扰时出现显著性能下降。解决方案的关键在于设计了 SCHEMA 评估框架,采用六条件因子设计与双分类器评分机制,在 67,221 条记录上对 11 个来自 8 家厂商的前沿模型进行测试,识别出“合规陷阱”(Compliance Trap)这一核心驱动因素:认知崩溃并非源于威胁内容本身,而是由强制服从指令破坏了模型的元认知边界。通过移除合规后缀可恢复模型性能,且发现具备高级推理能力的模型反而最易崩溃,而 Anthropic 的宪法对齐(Constitutional AI)因特定对齐训练实现了近乎免疫,凸显了对齐策略而非单纯能力的重要性。
链接: https://arxiv.org/abs/2605.02398
作者: Rahul Kumar
机构: Google(谷歌); Meta; Stability.AI; Anthropic; Character.ai; Claude; OpenAI; Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 3 tables. Code: this https URL Dataset: this https URL
Abstract:As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability – knowing what they do not know, detecting errors, seeking clarification – under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all p 2 \times 10^-8 , surviving Bonferroni correction). Crucially, we identify a “Compliance Trap”: through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic’s Constitutional AI demonstrates near-perfect immunity – not from superior capability (Google’s Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.
[NLP-29] Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
【速读】: 该论文旨在解决机器生成文本(Machine-Generated Text, MGT)检测在少样本(few-shot)场景下性能不足以及对对抗性、人类化攻击脆弱的问题。解决方案的关键在于提出一种名为REACT的对抗训练框架,其核心是通过威胁建模视角,设计一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的人类化攻击者与目标检测器协同进化:攻击者利用RAG生成高度人类化的对抗样本以逃避检测,而检测器则通过对比学习目标从这些对抗样本中稳定少样本表征学习并提升鲁棒性,二者交替优化实现检测性能与抗攻击能力的同步增强。
链接: https://arxiv.org/abs/2605.02374
作者: Wenjing Duan,Qi Zhou,Yuanfan Li
机构: Xi’an Jiaotong University (西安交通大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker’s viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.
[NLP-30] InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition ICML2026
【速读】: 该论文旨在解决在大语言模型(Large Language Model, LLM)预训练中,数据混合策略(data mixture recipes)和重复训练(repetition)对模型性能影响难以预测的问题,尤其是在数据有限且存在过拟合(overtraining)场景下,传统缩放定律(scaling laws)无法可靠外推。解决方案的关键在于提出InfoLaw(Information Scaling Laws),即一种基于信息积累的可解释性框架,通过建模训练过程中信息密度(由数据质量决定)与重复训练带来的边际收益递减(scale-dependent diminishing returns)来精准预测损失(loss)。该方法利用不同规模、质量分布及重复水平的数据集训练结果进行建模,实现了对未见过的数据配方和更大规模训练(最高达425B tokens)的高精度预测(平均绝对误差0.15%,最大0.96%),从而支持在不同计算预算下高效选择最优数据配方。
链接: https://arxiv.org/abs/2605.02364
作者: Fengze Liu,Weidong Zhou,Binbin Liu,Ping Guo,Zijun Wang,Bingni Zhang,Yifan Zhang,Yifeng Yu,Xiaohuan Zhou,Taifeng Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026
Abstract:Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.
[NLP-31] When Correct Isnt Usable: Improving Structured Output Reliability in Small Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际部署中面临的结构化输出可靠性问题,即模型输出在数学正确性与格式合规性(如 JSON 结构有效性)之间存在显著差距。研究发现,即使任务准确率较高(如 GSM8K 上达 85%),若缺乏有效格式控制,输出准确率可能为 0%,表明传统提示策略(如 NAIVE 或 REFERENCE 提示)无法保障结构化输出的稳定性。解决方案的关键在于提出 AloLab——一个基于元智能体(meta-agent)的迭代系统提示优化器(使用 Claude Sonnet 4.5),通过黑盒 API 接口自动优化提示模板,在不进行模型微调的前提下,实现高输出准确率(GSM8K 达 84–87%,MATH 达 34–40%),同时保持接近原始提示的推理延迟,并显著优于静态提示策略(p < 0.05)。
链接: https://arxiv.org/abs/2605.02363
作者: Cosimo Galeone,Minsu Park,Giuseppe Ettorre,Daniele Ligorio
机构: Alomana, Grottaglie, Italy; Google(谷歌); Meta; Stability.AI; Anthropic; Character.ai; Claude; OpenAI(OpenAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 6 figures, 4 tables
Abstract:Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks – GSM8K and MATH – as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy – the joint event of mathematical correctness and valid JSON structure – as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from 1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.
[NLP-32] MolViBench: Evaluating LLM s on Molecular Vibe Coding
【速读】: 该论文旨在解决当前大语言模型(LLM)在分子编程任务中缺乏针对性评估基准的问题,尤其是现有通用代码生成基准(如HumanEval)不涉及化学知识,而化学领域基准(如S²-Bench)主要聚焦于知识回忆或性质预测,而非可执行代码的生成能力。为此,作者提出MolViBench——首个专为“分子振动编码”(Molecular Vibe Coding)设计的基准测试集,涵盖358个精心设计的任务,覆盖从单API调用到端到端虚拟筛选流程设计的五个认知层级,并映射至12个真实药物发现工作流。其解决方案的关键在于构建一个多层次评估框架,结合类型感知的输出比对与基于抽象语法树(AST)的API语义回退分析,从而协同衡量生成代码的可执行性与化学正确性,为诊断LLM在AI加速分子发现中的编码能力提供了一个实用且细粒度的测试平台。
链接: https://arxiv.org/abs/2605.02351
作者: Jiatong Li,Yuxuan Ren,Weida Wang,Changmeng Zheng,Xiao-yong Wei,Qing Li,Yatao Bian
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: Github Link: this https URL
Abstract:Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs’ coding capabilities in AI-accelerated molecular discovery.
[NLP-33] Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中习得并延续社会偏见的问题,这些偏见可能强化关于性别、种族、宗教、残疾、年龄和经济地位等方面的刻板印象。传统解决方案如重新训练或基于人类反馈微调存在成本高、需访问模型权重且可能损害其他任务性能等局限。本文提出一种全新的方法:在解码阶段进行去偏处理,将偏见缓解视为对候选词元的结构化搜索过程,无需修改模型权重。其核心创新在于引入一个独立的流程奖励模型(Process Reward Model, PRM),作为裁判对每个候选词元同时评估公平性与流畅性,并设计了三种递进复杂度的方案(Best-of-N选择、序列式批判与修正、宪法式自审计)。实验表明,序列式去偏策略在多模型、多语言(英语与乌尔都语)基准上显著提升平均偏见得分达+0.40,同时保持甚至改善流畅性;进一步扩展至开放式生成时,通过轻量级偏见防护门机制仅在潜在偏见词汇触发时介入,使整体开销接近2倍,且生成器侧成本在原生实现中几乎为零。
链接: https://arxiv.org/abs/2605.02348
作者: Muneeb Ur Raheem Khan
机构: Lahore University of Management Sciences (Lahore University of Management Sciences)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 19 figures, preprint
Abstract:Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights. A separate Process Reward Model (PRM) acts as a judge, scoring each candidate for both fairness and fluency. We design three schemes of increasing sophistication (Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit) and evaluate them on four models (GPT-4o-mini, Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 3B) across a 200-prompt bilingual benchmark in English and Urdu covering eight bias categories. Sequential debiasing proves the most effective, raising mean bias scores by up to +0.40 over baseline while preserving (and sometimes improving) fluency. We then extend all three schemes to open-ended generation, where each token is debiased on the fly, and introduce a lightweight Bias Guard gate that fires only on potentially biased words, keeping overhead near 2x for well-calibrated models. A formal overhead metric that separates generator cost from judge cost reveals that Best-of-N is effectively free on the generator side in a native implementation. GPT-4o-mini, included as a strong proprietary anchor, confirms that the framework scales with model capability; the three open-weight models show where current small-scale LLMs still struggle.
[NLP-34] Structural Dilemmas and Developmental Pathways of Legal Argument Mining in the Era of Artificial Intelligence
【速读】: 该论文旨在解决法律论证挖掘(legal argument mining)领域发展缓慢的问题,其核心症结在于缺乏一种能够兼顾理论表达力与计算可行性的结构化表示方法。具体表现为数据标准化困境、有效建模障碍以及领域适应性局限。解决方案的关键在于重构研究范式,提出未来研究应聚焦于建立这种结构化的表示框架,从而在理论上更精准刻画法律论证结构,在技术上实现可扩展、可迁移的计算模型,为该领域的突破提供方向性指引。
链接: https://arxiv.org/abs/2605.02308
作者: Xianglei Liao,Chuanyi Li,Kun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Against the backdrop of rapid advances in artificial intelligence, legal argument mining has emerged as an important research area linking legal texts with intelligent analysis, carrying significant theoretical and practical implications. Existing studies have primarily developed along three dimensions: data, technology, and theory. At the data level, raw legal texts and annotated corpora constitute the foundational resources. At the technological level, research paradigms have evolved from rule-based systems and traditional machine learning to large language models (LLMs). At the theoretical level, argumentation theory and legal dogmatics provide important references for modeling argumentation structures. However, despite ongoing progress, the overall development of legal argument mining remains relatively slow. Building on a systematic review of existing research, this study conducts an in-depth analysis and finds that this is due not only to data scarcity or technical limitations, but more fundamentally to the lack of a structured representational approach that reconciles theoretical expressiveness with computational feasibility. Specifically, this challenge manifests in dilemmas in data standardization, obstacles to effective modeling, and limitations in domain adaptation. In response, the study proposes several key directions for future research. It aims to provide a reframing of key problems and a pathway for future development in legal argument mining, while leaving specific models and implementation schemes for further investigation.
[NLP-35] Compositional Multi-hop Factual Error Correction via Decomposition-and-Injection
【速读】: 该论文旨在解决多跳事实错误修正(Multi-hop Factual Error Correction, FEC)中因依赖跨多个证据源的组合推理而导致的挑战,尤其是现有方法将主张视为原子单元、难以定位复杂推理链中的语义错误,且受限于标注数据稀缺的问题。其解决方案的关键在于提出一种基于推理感知的合成框架CECoR(Compositional Error Correction via Reasoning-aware Synthesis),通过“分解与注入”范式将多跳主张分解为可解释的推理步骤,并引入受控扰动以合成高质量训练样本;同时采用监督微调与强化学习相结合的两阶段学习策略,显著提升模型的事实准确性与鲁棒性,从而在多跳基准测试中优于远距离监督方法和少样本大语言模型基线,并展现出对单跳修正任务的良好泛化能力及在噪声证据下的稳定性。
链接: https://arxiv.org/abs/2605.02277
作者: Lei Zhu,Xiaobao Wang,Jianbiao Yang,Chenyang Wang,Dongxiao He,Longbiao Wang,Jianwu Dang
机构: Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.
[NLP-36] A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
【速读】: 该论文旨在解决塔吉克语(西里尔字母)与波斯语(阿拉伯字母)之间准确音译(transliteration)的问题,这是跨语言信息处理中的一个关键挑战,尤其在资源稀缺的语言对中。其解决方案的关键在于构建了一个高质量、多源异构的平行语料库,并系统比较了六类现代机器学习架构在该任务上的性能表现。研究发现,基于字节级(byte-level)或字符级(character-level)建模的模型(如ByT5和G2P Transformer)显著优于依赖子词分词(subword tokenization)的传统多语言序列到序列模型(如mBART和mT5),其中ByT5在两个方向上均取得最优结果(chrF++达87.4和80.1),证明了针对此类低资源语言对,采用细粒度文本表示方式的模型更具优势。
链接: https://arxiv.org/abs/2605.02270
作者: Mullosharaf K. Arabov
机构: Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of “Shahnameh”, diplomatic articles, texts of “Masnavi-i Ma’navi”, official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization. Comments: Preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.02270 [cs.CL] (or arXiv:2605.02270v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.02270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-37] Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework
【速读】: 该论文旨在解决生成式 AI(Generative AI)在多语言临床决策支持场景下,尤其是在低资源语言环境中的可靠性、校准性和安全性不足的问题。研究聚焦于从英文、印地语和旁遮普语的自由文本临床笔记中进行结构化骨科诊断任务,评估不同建模策略的表现。其解决方案的关键在于提出一种领域自适应架构(IndicBERT-HPA),通过引入语言特定的骨科适配器头(language-specific orthopedic adapter heads),显著提升了跨语言判别能力和置信度行为的一致性,优于仅基于任务微调或零样本提示的大型语言模型(LLMs)。此外,作者还构建了一个概念性的确定性代理验证框架,强调证据核查、语言敏感验证与保守的人机协同控制机制,以确保高风险医疗场景下的安全部署。
链接: https://arxiv.org/abs/2605.02266
作者: Danish Ali,Li Xiaojian,Sundas Iqbal,Farrukh Zaidi
机构: Wuhan University (武汉大学); Nanjing University of Information Science and Technology (南京信息工程大学); Bahawal Victoria Hospital (巴哈瓦尔维克医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings. However, their reliability, calibration and safety characteristics remain insufficiently understood for structured, high-risk tasks. We present a system-level analysis of multilingual orthopedic diagnosis from free-text clinical notes in English, Hindi and Punjabi. We evaluate three modeling regimes: (i) task-aligned multilingual transformer encoders, (ii) a task-fine-tuned baseline (DistilBERT), and (iii) a domain-adaptive architecture tailored to orthopedic text (IndicBERT-HPA). These models are compared with zero-shot, instruction-tuned LLMs to assess suitability for structured diagnostic classification. Results indicate that while LLMs exhibit strong linguistic fluency, they show unstable calibration and reduced reliability under structured multilingual conditions, particularly in low-resource languages. These findings are specific to zero-shot evaluation and do not imply limitations of fine-tuned models. Domain-adaptive specialization substantially improves cross-lingual discrimination and confidence behavior. IndicBERT-HPA, with language-specific orthopedic adapter heads achieves consistently strong performance across six diagnostic categories and more predictable deployment characteristics than task-only adaptation. Building on these observations, we outline a conceptual deterministic agent-based validation framework for future implementation, formalizing evidence checks, language-sensitive validation and conservative human-in-the-loop gating. Reliable multilingual clinical decision support requires specialized architecture, explicit reliability analysis, and structured validation for safety-critical systems.
[NLP-38] WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
【速读】: 该论文旨在解决视频语言模型(Video Language Models, VLMs)中因视觉标记(visual token)序列过长而导致的推理延迟高和GPU内存占用大的问题。现有方法虽采用基于标记粒度的混合精度量化(mixed-precision quantization)来优化键值(Key-Value, KV)缓存,但存在搜索过程耗时且计算硬件效率低的问题。其解决方案的关键在于提出一种名为WindowQuant的新方法,通过两个核心模块实现:一是窗口级量化搜索(window-level quantization search),根据视觉标记窗口与文本提示之间的相似度快速确定最优位宽配置,从而在保持模型精度的同时提升量化效率;二是窗口级KV缓存计算(window-level KV cache computation),在量化前对KV缓存窗口进行重排序,有效缓解混合精度量化带来的硬件计算不高效问题。实验证明,WindowQuant在多个数据集上优于当前最先进的VLM模型及KV缓存量化方法。
链接: https://arxiv.org/abs/2605.02262
作者: Wei Tao,Xiaoyang Qu,Peiqiang Wang,Guokuan Li,Jiguang Wan,Kai Lu,Jianzong Wang
机构: Huazhong University of Science and Technology (华中科技大学); Ping An Technology (Shenzhen) Co., Ltd. (平安科技(深圳)有限公司); Tsinghua University, Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ACM Transactions on Architecture and Code Optimization (ACM TACO)
Abstract:Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.
[NLP-39] An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection IJCAI ECAI2026
【速读】: 该论文旨在解决不完整传播数据对虚假新闻检测(fake news detection)性能的限制问题,特别是由大语言模型生成的合成传播信号(synthetic propagation)因内在不可靠性可能导致表示偏倚的问题。解决方案的关键在于提出一种基于信息论的传播去噪与融合框架(InfoPDF),其核心是通过建模每种属性特定的合成传播图为概率潜分布,引导可靠性感知的自适应融合策略,并设计基于互信息的目标函数,在压缩传播表示的同时确保任务充分性,从而有效抑制跨属性合成传播中的噪声、保持真实与合成传播表示的一致性,并提升虚假新闻检测和属性预测的判别能力。
链接: https://arxiv.org/abs/2605.02259
作者: Mengyang Chen,Lingwei Wei,Wei Zhou,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL)
备注: Accepted by IJCAI-ECAI 2026, Resource are provided in this https URL
Abstract:Incomplete propagation data significantly hinders robust fake news detection. Recent approaches leverage large language models to simulate missing user interactions via role-playing, thereby enriching propagation with synthetic signals. However, such propagation data is intrinsically unreliable, and directly fusing it can lead to biased representations, leading to limited detection performance. In this paper, we alleviate the unreliability of synthetic propagation from the mutual information perspective and propose a novel information-theoretic propagation denoising and fusion (InfoPDF) framework to learn effective representations from both real and synthetic propagation. Specifically, we first generate attribute-specific synthetic propagation using large language models. Then we model each synthetic propagation graph as a probabilistic latent distribution to guide reliability-aware adaptive fusion with real propagation. During training, we design a mutual information-based objective to learn compressed and task-sufficient propagation representations. It jointly suppresses noisy signals across attribute-specific synthetic propagation, maintains consistency between real and synthetic propagation representations, and ensures task sufficiency for fake news detection and attribute prediction. Experiments on three real-world datasets show that InfoPDF consistently achieves superior performance across various fake news detection tasks. Further analysis demonstrates that InfoPDF can estimate attribute-level reliabilities and learn more discriminative propagation representations.
[NLP-40] Zero-Shot Confidence Estimation for Small LLM s: When Supervised Baselines Arent Worth Training
【速读】: 该论文旨在解决小语言模型(Small Language Model, SLM)如何在无需监督训练数据的情况下可靠地估计自身输出正确性的问题,从而实现本地到云端的智能路由策略——即优先将简单查询交由低成本本地模型处理,仅对复杂任务调用昂贵的云端大语言模型(Large Language Model, LLM)。其核心解决方案是提出一种零样本(zero-shot)置信度信号:平均词元对数概率(average token log-probability),该方法不依赖标注数据即可在分布内(in-distribution)达到或优于基于监督学习的基线(AUROC 0.650–0.714 vs. 0.644–0.676),并在分布外(out-of-distribution)显著超越后者(0.717–0.833 vs. 0.512–0.564),因其衡量的是模型生成过程本身的特性而非查询分布。进一步地,论文还引入检索条件自评估(retrieval-conditional self-assessment),通过在相似度高时选择性注入检索知识,在延迟仅为对数概率方法的1/3至1/10的前提下,将性能提升最多达+0.069 AUROC,且优于使用1,000个标注样本训练的监督基线。
链接: https://arxiv.org/abs/2605.02241
作者: Luong N. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:
Abstract:How reliably can a small language model estimate its own correctness? The answer determines whether local-to-cloud routing-escalating queries a cheap local model cannot handle-can work without supervised training data. As inference costs dominate large language model (LLM) deployment budgets, routing most queries to a cheap local model while reserving expensive cloud calls for hard cases is an increasingly common cost-control strategy. We compare zero-shot confidence signals against RouteLLM-style supervised baselines across three 7-8B model families and two datasets (1,000 and 500 queries per model, respectively). Average token log-probability, which requires no training data, matches or exceeds supervised baselines in-distribution (Area Under the Receiver Operating Characteristic curve (AUROC) 0.650-0.714 vs. 0.644-0.676) and substantially outperforms them out-of-distribution (0.717-0.833 vs. 0.512-0.564), because it measures a property of the model’s generation rather than the query distribution. This paper further proposes retrieval-conditional self-assessment, a pre-generation signal that selectively injects retrieved knowledge when similarity is high, improving over bare self-assessment by up to +0.069 AUROC at 3-10x lower latency than log-probability. A supervised baseline trained on 1,000 labeled examples never exceeds the zero-shot signal. We release all code, data, and experiment logs.
[NLP-41] Perturbation Dose Responses in Recursive LLM Loops: Raw Switching Stochastic Floors and Persistent Escape under Append Replace and Dialog Updates
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在递归语言模型循环中因上下文更新策略(如追加、替换和对话更新)导致的“吸引子”模式稳定性问题,核心关注点是:如何通过注入文本实现对已稳定循环路径的有效且持久的重定向。其解决方案的关键在于区分短期扰动与长期逃逸行为,并识别出在追加模式下,持久性重定向(destination-coherent persistence)和源基域逃逸(source-basin escape)均受记忆策略条件约束——即模型对历史信息的处理机制决定了重定向是否可维持;此外,研究还揭示了不同上下文更新规则对重定向效果的影响显著差异,且存在结构性非单调效应(如高剂量下的持久性下降),这表明上下文更新规则本身应被视为安全设计中的首要考量因素。
链接: https://arxiv.org/abs/2605.02236
作者: Pawel Kaplanski(Kaplanski AI Lab)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 90 pages, 31 figures. Code, configurations, trajectories, and aggregated reports: this https URL
Abstract:Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model from the context-update rule: append, replace, and dialog updates expose different histories to the same generator. The main result is that persistent redirection in append-mode recursive loops is memory-policy-conditioned. Under a 12,000-character tail clip, destination-coherent persistence plateaus near 16 percent and retained source-basin escape near 36 percent at dose 400; neither crosses 50 percent. Under a full-history protocol, retained source-basin escape crosses 50 percent near 400 tokens and saturates at 75-80 percent by 1,500 tokens, while destination-coherent persistence first reaches 0.50 near 1,500 tokens with a Wilson 95 percent CI of [0.41, 0.61]. For raw switching, adversarial continuations yield an ED50 near 40 tokens, with paired-control floors near 35 percent and net switching never reaching +50 percentage points within 5-400 tokens. Replace-mode raw switching is near-saturated but largely reflects state-reset overwrite: insert-mode probes drop it to 12-32 percent. A homogeneous-perturbation control reproduced the high-dose non-monotonic dip in destination-coherent persistence, refuting perturbation heterogeneity as the cause; the dip appears structural, with mechanism unresolved. We report 37 experiments on gpt-4o-mini with within-vendor replication on gpt-4.1-nano. Recursive-loop evaluations should distinguish transient movement from durable escape, subtract stochastic floors, and treat context-update rules as first-class safety-relevant design choices.
[NLP-42] Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
【速读】: 该论文旨在解决当前因果抽象(causal abstraction)类可解释性方法中缺乏诊断能力的问题,即现有方法通常仅提供全局的干预准确性指标,无法定位解释在哪些输入区域有效或失效,从而难以指导改进。其解决方案的关键在于:通过分析成对干预行为(pairwise interchange-intervention behavior),将输入空间划分为高保真(well-interpreted)和低保真(under-interpreted)区域,从而将因果抽象从单一全局评估转变为具有诊断功能的工具。这一划分不仅揭示了解释的有效性边界,还为识别缺失的高阶假设、发现未建模的中间变量以及整合互补部分解释提供了可操作的路径,最终实现更精确、可构造且可扩展的机制解释(mechanistic interpretability)。
链接: https://arxiv.org/abs/2605.02234
作者: Li Puyin,Jiyuan Tan,Ahmad Jabbar,Thomas Icard,Atticus Geiger
机构: Stanford University (斯坦福大学); Goodfire (好火)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.
[NLP-43] ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring ACL2026
【速读】: 该论文旨在解决在线广告治理中因监管政策非平稳性(non-stationary nature of regulatory policies)导致的历史数据标签不一致和推理模糊问题,尤其在新兴监管要求(如教育类内容限制或审美焦虑管控)出现时,传统方法难以适应。解决方案的关键在于提出ARGUS系统,其核心是通过多智能体对抗裁判机制实现持续进化式强化学习:首先采用“政策播种”(Policy Seeding)建立初始感知;继而利用“检察官-辩护人-裁判”架构进行对抗性标签校正,以弥合旧标签与新政策间的冲突;最后通过三元辩证讨论挖掘隐蔽的“灰色地带”违规行为。该系统结合RAG增强的政策知识与思维链(Chain-of-Thought)合成动态奖励信号,使模型推理路径与政策演进保持同步,从而在极少人工标注数据下实现高效的政策自适应学习。
链接: https://arxiv.org/abs/2605.02200
作者: Deyi Ji,Junyu Lu,Xuanyi Liu,Liqun Liu,Hailong Zhang,Peng Shu,Huan Yu,Jie Jiang,Tianru Chen,Lanyun Zhu
机构: Tencent(腾讯); Dalian University of Technology (大连理工大学); Peking University (北京大学); Zhejiang University (浙江大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 (Industry Track)
Abstract:Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a Prosecutor-Defender-Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, gray-area’’ violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.
[NLP-44] CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse SEMEVAL-2026
【速读】: 该论文旨在解决美国总统访谈中问答对的响应清晰度(response clarity)与回避检测(evasion detection)问题,属于自然语言处理中的文本分类任务,具体包括3类(Task 1)和9类(Task 2)细粒度标签。其核心解决方案是对比微调(fine-tuning)的Transformer编码器与无需参数更新的提示工程(prompt-based)大语言模型(LLM)之间的性能差异。关键发现在于:(1)部分层解冻(partial encoder layer unfreezing)比全量微调更有效;(2)结合英文与多语言编码器可提升集成性能,即使后者单独表现较弱;(3)提示式LLM在不进行任何任务特定参数调整的情况下优于微调编码器,尤其在少数类上表现突出;(4)输入增强(如拼接完整提问者语句)显著提升LLM性能但对编码器无效,表明二者在信息利用机制上存在本质差异。
链接: https://arxiv.org/abs/2605.02170
作者: Nawar Turk,Lucas Miquet-Westphal,Leila Kosseim
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, 9 tables. System description paper for SemEval-2026 Task 6 (CLARITY): ranked 9th/41 on Task 1 and 3rd/33 on Task 2
Abstract:In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and 59 on the 9-class Task 2 (3rd/33). Across 8 transformer encoders optimized through a four-stage pipeline, partial encoder layer unfreezing outperforms full fine-tuning by a wide margin. Combining English and multilingual encoders further improves ensemble performance over either family alone, despite multilingual models being individually weaker. Prompt-based LLMs, without any task-specific parameter updates, outperform fine-tuned encoders, particularly on minority classes; among open-weight LLMs, parameter count does not predict performance. Enriched input, concatenating the full interviewer turn, improves LLM performance but not that of encoders, an effect that persists with Longformer’s extended context window, suggesting the divergence is not attributable to sequence-length capacity alone in our settings. The Clear Reply/Ambivalent boundary remains the dominant failure mode, mirroring the disagreement among human annotators. Our code, prompts, model configurations, and results are publicly available.
[NLP-45] Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting ICML2026
【速读】: 该论文旨在解决预训练优化器在生成基础模型时忽视了模型参数空间几何结构的问题,即当前方法假设更强的初始模型能带来更好的后续微调和量化性能,但未考虑基础模型的几何特性如何影响参数更新后能力的保留。解决方案的关键在于通过引入三种偏向平坦极小值(flat minima)的预训练优化策略——Sharpness-Aware Minimization (SAM)、大学习率和缩短的学习率衰减周期——来提升基础模型在后续微调和量化过程中的鲁棒性,从而显著减少遗忘现象(最多降低80%),并在不同规模模型(20M至150M参数)及大规模场景(如OLMo-2-1B模型)中验证其有效性。
链接: https://arxiv.org/abs/2605.02105
作者: Ishaan Watts,Catherine Li,Sachin Goyal,Jacob Mitchell Springer,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 43 pages, 64 figures, 9 tables, accepted to ICML2026
Abstract:Pretraining optimizers are tuned to produce the strongest possible base model, on the assumption that a stronger starting point yields a stronger model after subsequent changes like post-training and quantization. This overlooks the geometry of the base model which controls how much of the base model’s capabilities survive subsequent parameter updates. We study three pretraining optimization approaches that bias optimization toward flatter minima: Sharpness-Aware Minimization (SAM), large learning rates, and shortened learning rate annealing periods. Across model sizes ranging from 20M to 150M parameters, we find that these interventions consistently improve downstream performance after post-training on five common datasets with up to 80% less forgetting. These principles hold at scale: a short SAM mid-training phase applied to an existing OLMo-2-1B checkpoint reduces forgetting by 31% after MetaMath post-training and by 40% after 4-bit quantization.
[NLP-46] EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
【速读】: 该论文旨在解决科学文献中局部事实性修改(如数据集规模从215份文档变为80份)引发的非局部修订义务问题,即编辑后相关表述(如“中等规模”或“数百项”)可能失效但未被自动更新的问题。解决方案的关键在于提出EditPropBench基准测试框架,该框架通过控制的合成论文、句子级依赖图标注(包括直接目标、必需下游更新和受保护无关文本)、三种编辑协议、对抗性度量探针及压力测试变体,系统评估大语言模型(LLM)编辑器在事实修改时是否能正确传播更新至相关语句。核心指标为Edit-Ripple Adherence(ERA),用于量化编辑传播的准确性,结果显示当前LLM编辑器虽能处理部分隐含后果,但仍存在约30%的漏检率,表明可靠科学修订仍需具备级联感知能力的检查机制。
链接: https://arxiv.org/abs/2605.02083
作者: Garvin Kruthof
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as ‘medium-scale’ or ‘a few hundred items’ may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148–0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.
[NLP-47] Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理能力提升中对奖励函数设计敏感的问题,即现有基于强化学习(Reinforcement Learning, RL)的后训练方法性能高度依赖于人工设计的奖励函数。其解决方案的关键在于提出一种搜索驱动的框架,将奖励函数本身作为可优化对象:通过前沿语言模型生成候选奖励函数,利用Group Relative Policy Optimization (GRPO) 在固定基础模型(Llama-3.2-3B-Instruct + LoRA)上进行自动验证与筛选,并以GSM8K测试集上的F1分数为指标进行排序;随后将高阶排名摘要反馈至下一轮生成,形成闭环迭代优化。实验表明,经过五轮搜索后,最优单个奖励函数达到F1=0.787,最佳集成配置实现F1=0.795,显著优于基线(F1=0.609),且统计检验确认多奖励组合效果稳定,证明了该反馈驱动的奖励搜索机制是性能提升的核心驱动力。
链接: https://arxiv.org/abs/2605.02073
作者: Arash Ahmadi,Sarah Sharif,Yaser(Mike)Banad
机构: University of Oklahoma (俄克拉荷马大学); INQUIRE Laboratory (智能神经形态和量子理解创新研究与工程实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at \alpha = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.
[NLP-48] Pair2Score: Pairwise-to-Absolute Transfer for LLM -Based Essay Scoring
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在自动作文评分(Automated Essay Scoring, AES)中对绝对评分预测的依赖与成对比较学习目标之间存在的不匹配问题。传统方法通常直接训练绝对评分模型,但这类模型在数据标注成本高、样本稀缺时表现受限;而基于成对比较的学习方式虽然更易获取且具鲁棒性,却难以转化为具体的绝对分数输出。解决方案的关键在于提出 Pair2Score 两阶段学习框架:第一阶段利用从绝对标签中提取的成对比较训练一个方向性 Siamese 排序器(directional Siamese ranker),第二阶段通过可配置的迁移策略(如 warm-start 和 embedding-fusion 变体)将成对信息有效映射到绝对评分预测上,从而实现参数高效地利用成对信号提升绝对评分性能。实验表明,最优转移配置能在多个 AES 特征(语法、词汇、句法)上显著优于纯绝对训练基线,且转移策略的选择比是否引入成对阶段更为关键。
链接: https://arxiv.org/abs/2605.02069
作者: İbrahim Rıza Hallaç,Hasan Oğul
机构: Østfold University College (奥斯陆大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 2 figures
Abstract:Many scoring applications require absolute predictions, while pairwise comparisons can provide a simpler learning objective. We present Pair2Score, a two-stage learning framework that transfers pairwise comparisons into absolute scoring with parameter-efficient LLaMA adaptation. Stage 1 trains a directional Siamese ranker on pairwise comparisons derived from absolute trait labels; Stage 2 trains an absolute predictor using configurable transfer strategies (warm-start and embedding-fusion variants). We evaluate on rubric-aligned Automated Essay Scoring (AES) traits (grammar, vocabulary, syntax) under a five-fold protocol that co-rotates held-out fold and random seed. At the trait level, the best-performing transfer variant improves quadratic weighted kappa (QWK) over an absolute-only baseline for all three traits. However, not all transfer configurations help: a one-epoch pairwise stage transfers more reliably than extended pairwise training, and transfer configuration – not just the inclusion of a pairwise stage – determines whether downstream scoring benefits.
[NLP-49] Methods Data and Conceptual Change: Reflections from Two Quantitative Diachronic Case Studies
【速读】: 该论文旨在解决定量历史语言学方法在处理语料库数据时,如何受数据特性影响以准确识别语义变化的问题。其解决方案的关键在于通过对比两种不同的分析方法——基于四元组(quad-based)的概念建模与SynFlow语义流分析——来揭示不同方法对概念操作化、数据假设及历时解释的支持差异。研究发现,单纯依赖词汇频率的方法存在局限性,而语料库结构本身显著影响了定量方法检测语义变化的可靠性,从而强调了方法论反思在跨语料库比较中的核心作用。
链接: https://arxiv.org/abs/2605.02052
作者: Catherine Wong,Bach Phan-Tat,Susan Fitzmaurice
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This discussion paper reflects on how quantitative approaches to historical linguistics interact with dataset properties. Drawing on two worked examples, we examine English data using quad-based concept modelling of Early Modern English discourse in EEBO-TCP (c. 1470s-1690s; 765M words) alongside SynFlow analysis of scientific writing in Royal Society Corpus 6.0.4 (1750-1799; drawn from a 78.6M-token open corpus). Through parallel comparison, the paper explores how each approach operationalises concepts, the data assumptions they entail, and the diachronic interpretations they support. We argue that comparative methodological reflection clarifies the limits of purely lexical, frequency-based approaches and highlights how dataset structure shapes the kinds of semantic change that quantitative methods can reliably detect.
[NLP-50] What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
【速读】: 该论文旨在解决当前语言模型可靠性评估中因单一提示(single-prompt)基准测试而可能遗漏关键可靠性缺陷的问题。传统以单提示准确率为主导的评测方式无法充分揭示模型在实际应用中的稳定性与可信度问题,例如置信度信号失真、提示扰动敏感性差异以及校准定义对结论的影响。其解决方案的关键在于构建一个多层次、多维度的可靠性评估框架,涵盖准确率、token概率校准、口头置信度校准、口头解析率及提示扰动扩散等指标,并系统性地分析不同提示变体下的表现差异。研究表明,评估设计本身(如ECE定义、评估器逻辑、提示鲁棒性)显著影响可靠性结论,因此强调必须明确报告校准定义、评估逻辑、口头可解析性及提示鲁棒性,才能得出可靠的语言模型可靠性判断。
链接: https://arxiv.org/abs/2605.02038
作者: Ranit Karmakar,Jayita Chatterjee
机构: Harvard University (哈佛大学); Boston, MA 02115
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary models; two independent repair procedures recover 93.8% and 102.7% of the lost performance, indicating an evaluator-side rather than model-side failure. Second, confidence signals are fragile. On MMLU-Pro, every primary model verbally reports confidence substantially above both its accuracy and its token-probability confidence on the same rows, and verbal parse rate can collapse for a single model on a single prompt variant. Third, prompt robustness does not track parameter count reliably. Across 10 instruct models, the correlation between model size and prompt-perturbation spread ranges from -0.244 to 0.474 across benchmarks. Taken together, these results show that reliability conclusions for small language models depend not only on the model being evaluated, but also on the evaluation pipeline used to measure it. We argue that calibration definitions, evaluator logic, verbal parseability, and prompt robustness should be reported explicitly when making reliability claims.
[NLP-51] A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
【速读】: 该论文旨在解决多模态机器翻译(Multimodal Machine Translation, MMT)中语义歧义消解(Ambiguity Resolution)的挑战,即模型需真正利用视觉输入将歧义表达映射到其 intended meaning。现有研究虽提出面向歧义消解的评估基准,但存在数据质量不佳和与实际翻译场景不匹配的问题,且难以覆盖开放域中的多样化歧义类型。为解决上述问题,作者提出了VIDA数据集,其中包含2,500个精心标注的歧义实例,每个实例的消解均依赖视觉证据;并设计了以大语言模型(Large Language Model, LLM)作为评判器的消解中心指标(Disambiguation-Centric Metrics),用于在词元层面验证歧义是否被正确消除。关键解决方案在于引入链式思维监督微调(Chain-of-Thought Supervised Fine-Tuning, CoT-SFT),实验表明该方法相较于传统监督微调(SFT)在歧义消解准确率上取得更稳定提升,尤其在分布外子集上表现更强泛化能力,从而有效增强模型对多样歧义类型的处理能力。
链接: https://arxiv.org/abs/2605.02035
作者: Jingheng Pan,Xintong Wang,Longyue Wang,Liang Ding,Weihua Luo,Chris Biemann
机构: Universität Hamburg (汉堡大学); Alibaba Group (阿里巴巴集团); Alibaba Cloud (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.
[NLP-52] Counting as a minimal probe of language model reliability
【速读】: 该论文试图解决的问题是:当前大型语言模型在数学推理、编码和文档分析等任务中表现出色,但这是否意味着它们具备真正的通用逻辑能力,还是仅依赖于重复应用已学习的规则或模式匹配来模拟规则执行?为回答这一问题,研究者提出了一种名为“稳定计数能力(Stable Counting Capacity)”的测试方法,其关键在于设计一个去除知识依赖、语义模糊性和词汇/分词混淆因素的评估范式,从而直接衡量模型在执行程序性任务时的可靠性。实验表明,大多数模型的稳定计数能力远低于其标称上下文长度,且行为不支持开放式的逻辑推理或稳定规则应用,而是表现为有限数量的内部状态(类似手指计数),一旦耗尽即退化为猜测,即使增加推理时计算资源也无法恢复准确执行。这揭示了当前语言模型的流畅表现并不等同于可靠的通用规则遵循能力。
链接: https://arxiv.org/abs/2605.02028
作者: Tianxiang Dai,Jonathan Fan
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: for accessing the supplementary information, data, and code for reproduction of the study, see this https URL
Abstract:Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.
[NLP-53] Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对语料异质性和细微条件变化时表现不稳定的问题,同时克服微调(fine-tuning)导致的灾难性遗忘(catastrophe forgetting)以及元学习(meta-learning)在LLMs中因复杂性和可扩展性限制而难以应用的挑战。其解决方案的关键在于激活SwiGLU模块中的β参数,引入一种基于元信号的元门控机制(meta-gating mechanism),该机制能够自适应地调节前馈网络(Feed-Forward Network, FFN)的非线性特性;进一步地,通过一个超网络(hypernetwork)根据文本条件动态生成β值,从而实现对LLMs的元可控性(meta-controllability)。此方法在任务、领域、人物设定和风格等多种条件下均优于微调与元学习基线,并具备对未见任务或指令的良好泛化能力。
链接: https://arxiv.org/abs/2605.01973
作者: Luo Ji,Qi Qin,Ningyuan Xi,Teng Chen,Qingqing Gu,Hongyan Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICML2026
Abstract:Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta-learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of \beta within the SwiGLU blocks, resulting in a meta-gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces \beta on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in this https URL.
[NLP-54] Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks
【速读】: 该论文旨在解决静态参数分配在低秩适配(Low-Rank Adaptation, LoRA)方法中对不同复杂度输入适应性不足的问题,即固定秩配置难以有效匹配输入的语义或推理复杂度,导致资源利用效率低下且性能受限。其解决方案的关键在于提出Flexi-LoRA框架,通过在训练和推理阶段均基于输入复杂度动态调整LoRA的秩(rank),实现输入感知的参数分配策略;该方法不仅提升了模型在复杂任务(如数学推理)中的表现,还以更少的参数实现了更高的准确性和更强的指令遵循能力,从而在保持高效性的同时增强了模型的适应性与推理质量。
链接: https://arxiv.org/abs/2605.01959
作者: Zongqian Li,Yixuan Su,Han Zhou,Zihao Fu,Nigel Collier
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become essential for deploying large language models, yet their static parameter allocation remains suboptimal for inputs of varying complexity. We present Flexi-LoRA, a novel framework that dynamically adjusts LoRA ranks based on input complexity during both training and inference. Through empirical analysis across question answering, mathematical reasoning, and speech tasks, we demonstrate that maintaining consistency between training and inference dynamics is important for effective adaptation, particularly for sequential reasoning tasks. Our findings reveal that input-dependent parameter allocation achieves higher performance with fewer parameters by optimally matching rank configurations to question complexity. Furthermore, task-specific dependency on rank dynamics varies, with mathematical reasoning tasks exhibiting higher dependency than QA tasks. Successful adaptation manifests not only in correctness but also in reasoning quality and instruction adherence. Flexi-LoRA consistently outperforms static LoRA while using fewer parameters, with performance gains more pronounced on tasks requiring strict reasoning chains. Our approach realizes key benefits of mixture-of-experts frameworks through a more streamlined implementation, reducing parameter redundancy while improving model capabilities. We provide comprehensive empirical studies across diverse tasks, establishing a basis for future work in input-adaptive and efficient fine-tuning approaches.
[NLP-55] StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models IJCAI-2026
【速读】: 该论文旨在解决静态大语言模型(Large Language Models, LLMs)评估中因数据污染(contamination)和过拟合(overfitting)导致的性能高估问题,尤其是在知识密集型推理任务中。现有动态基准虽能缓解数据陈旧性,但常以牺牲答案可得性(answerability)和可控性为代价。其解决方案的关键在于提出 StressEval——一个基于模型失败驱动的数据合成框架,通过三个阶段实现:首先构建半结构化的难度卡片(difficulty card)识别失败推理步骤及其根本原因;其次采用双视角实例合成方法,同时针对知识缺口与推理断裂点进行精准干预,同时保留原始难度因子;最后引入门控机制筛选出有依据且无歧义的测试实例。该方法成功生成了 Dynamic OneEval 动态基准,显著提升了多个先进 LLM 的性能下降幅度,同时保持明确的难度控制能力,从而支持更有效的迭代优化。
链接: https://arxiv.org/abs/2605.01939
作者: Yongrui Chen,Yangyang Ma,Xiaoying Huang,Shenyu Zhang,Huajun Chen,Haofen Wang,Guilin Qi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by IJCAI-2026
Abstract:Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration
[NLP-56] RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLM s
【速读】: 该论文旨在解决下游任务微调过程中安全对齐语言模型的拒绝行为显著退化的问题,这种退化使得模型容易遭受对抗性滥用。研究表明,标准微调会导致安全相关表征在激活空间中发生系统性漂移、几何结构扭曲以及任务优化与安全特征之间的干扰,从而引发更高的有害合规性。解决方案的关键在于提出一种基于表征层面的微调框架REFUSALGUARD,该框架通过约束隐藏表征空间中的更新,确保安全中介组件的稳定性,同时允许在互补方向上进行任务特定的学习,从而在保持任务性能的同时有效维持模型的安全性。
链接: https://arxiv.org/abs/2605.01913
作者: Sadia Asif,Mohammad Mohammadi Amiri
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within the model’s activation space, how these representations change during fine-tuning and why alignment degrades remains poorly understood. In this work, we investigate the representation-level mechanisms underlying alignment degradation. Our analysis shows that standard fine-tuning induces systematic drift in safety-relevant representations, distorts their geometric structure, and introduces interference between task optimization and safety features. These effects collectively lead to increased harmful compliance. Motivated by these findings, we introduce REFUSALGUARD, a representation-level fine-tuning framework that preserves safety-relevant structure during model adaptation. Our approach constrains updates in hidden representation space, ensuring that safety-mediating components remain stable while allowing task-specific learning in complementary directions. We evaluate REFUSALGUARD across multiple model families, including LLaMA, Gemma, and Qwen, on adversarial safety benchmarks such as AdvBench, DirectHarm4, and JailbreakBench, as well as downstream utility tasks. Our approach achieves attack success rates comparable to base safety-aligned models while maintaining competitive task performance, significantly outperforming baselines.
[NLP-57] Spoken Language Identification with Pre-trained Models and Margin Loss
【速读】: 该论文针对的是说话人控制的语音语言识别任务(speaker-controlled spoken language identification),旨在提升语言识别的准确性并降低非语言因素(如说话人特征)带来的干扰。解决方案的关键在于采用预训练的ECAPA-TDNN作为特征编码器,并引入基于边距(margin-based)的损失函数,以增强语言表征的判别能力,从而提高类间可分性与鲁棒性。实验表明,该方法在Tidy-X数据集上显著优于官方基线,在语言识别任务中达到85.95%宏准确率和90.96%微准确率,同时在验证任务中将等错误率(EER)降低至17.08%,相较基线改善幅度分别达45.7%、15.2%和50.8%。
链接: https://arxiv.org/abs/2605.01905
作者: Zhihua Fang,Liang He,Weiwu Jiang
机构: Xinjiang University (新疆大学); Tsinghua University (清华大学); Agibot (阿吉博特)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Technical report for the TidyLang 2026 Challenge. Accepted at Odyssey 2026
Abstract:For the speaker-controlled spoken language identification task proposed in the TidyLang Challenge 2026, this paper proposes a language identification method based on pre-trained models and margin-based losses. The proposed method adopts a pre-trained ECAPA-TDNN as the feature encoder and incorporates margin-based losses to enhance the discriminative ability of language representations, thereby improving inter-class separability and reducing the interference of non-linguistic factors such as speaker characteristics. Experimental results on the Tidy-X dataset show that the proposed method achieves 85.95% macro accuracy and 90.96% micro accuracy on the language identification task and 17.08% equal error rate (EER) on the verification task. Compared with the official baseline, the macro accuracy improves by 45.7%, the micro accuracy improves by 15.2%, and the EER is reduced by approximately 50.8%, demonstrating the effectiveness of the proposed method. The code will be released at this https URL.
[NLP-58] Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models
【速读】: 该论文旨在解决两个核心问题:一是现有大语言模型(Large Language Models, LLMs)在处理超出其训练分布的复杂查询时易产生错误响应,且推理效率低、部署成本高;二是多语言LLM对资源匮乏语言(如现代希腊语)的支持不足,因缺乏高质量的训练与评估数据。解决方案的关键在于:构建一个高质量、由大推理模型(Large Reasoning Models, LRMs)生成并经人工校验的希腊语问答数据集CulturaQA,以此为基础开发轻量高效的LLM评估框架和针对希腊语优化的Maistros 8B模型(通过知识蒸馏与微调实现),从而提升低资源语言场景下的模型性能与实用性。
链接: https://arxiv.org/abs/2605.01870
作者: Nikolaos Giarelis,Charalampos Mastrokostas,Nikos Karacapilidis
机构: MEAD University of Patras (帕特雷大学MEAD实验室)
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures. Submitted to a journal
Abstract:Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large-scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi-step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state-of-the-art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi-GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high-resource languages. However, these models exhibit limited performance in under-resourced languages, primarily due to insufficient language- and culture-specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high-quality LRM-generated and human-curated dataset, for Greek LLM training and evaluation; (ii) a memory-efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state-of-the-art open-weights Greek LLM developed via knowledge distillation and fine-tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human-curated Greek QA datasets.
[NLP-59] Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)中生成的推理轨迹是否真正反映内部计算过程,还是仅表现为冗余或过度思考的问题。现有研究表明,隐藏状态(hidden-state)包含与正确性相关的信号,但其粗粒度聚合可能掩盖了推理过程中token和层结构的精细动态。论文的关键解决方案是提出一种无需训练的轨迹统计量——时空隐状态转移幅度(Spatiotemporal Amplitude of Latent Transition, StALT),该指标通过加权相邻token间的时序变化与单个token内各层的重要性(layer saliency),量化推理路径中的局部层集中性和全局时间动态性。实验证明,StALT能有效区分正确与错误推理轨迹,在多种模型和基准测试中优于输出空间和长度基线,并且对干预操作具有系统响应,从而为理解LRMs内部推理机制提供可测量、可解释的探针。
链接: https://arxiv.org/abs/2605.01853
作者: Kotaro Furuya,Takahito Tanimura
机构: Hitachi, Ltd. (日立有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) generate extended solutions, yet it remains unclear whether these traces reflect substantive internal computation or merely verbosity and overthinking. Although recent hidden-state analyses suggest that internal representations carry correctness-related signals, their coarse aggregations may obscure the token and layer structure underlying reasoning computation. We investigate hidden-state transitions across decoding steps and layers, and identify a distinct spatiotemporal pattern in LRMs: successful trajectories exhibit broad temporal dynamics with localized layer-wise concentration, while this structure is weaker in non-reasoning models and knowledge-heavy domains. We formalize this characteristic as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across diverse models and benchmarks, StALT reliably separates correct from incorrect trajectories in reasoning-intensive regimes, providing a competitive label-free correctness signal alongside strong output-space and length-based baselines. Intervention analyses further show that this spatiotemporal amplitude responds systematically to manipulations that increase or reduce the demand for internal reasoning, supporting its association with latent reasoning dynamics in LRMs. These findings provide empirical evidence that LRMs exhibit measurable hidden-state dynamics and offer a practical probe for understanding internal computation beyond output-based evaluation.
[NLP-60] Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成多项选择题(Multiple-Choice Questions, MCQs)时存在的系统性答案位置偏倚问题,即正确选项并非均匀分布于各选项位置,而是倾向于集中在特定位置(如A、B或C),这会降低MCQ的质量和公平性。解决方案的关键在于通过激活操控(activation steering)技术干预模型内部表示,具体而言,研究发现问题题干的隐藏表征中编码了正确答案位置的预测信号,表明答案位置在生成过程中已被隐式规划;基于此,作者利用激活操控调节这些内部表征,从而部分控制并显著改变答案位置的分布,为实现可控生成提供了可操作的框架。
链接: https://arxiv.org/abs/2605.01846
作者: Xuemei Tang,Xufeng Duan,Zhenguang G. Cai
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs), where correct answers should ideally be uniformly distributed across options. However, we observe that LLMs exhibit systematic position biases during generation. Through extensive experiments with 10 LLMs and 5 vision-language models (VLMs) on three MCQ generation tasks, we show that these biases are structured, with similar patterns emerging within model families. To investigate the underlying mechanisms, we conduct probing experiments and find that hidden representations in the question stem encode predictive signals of the correct answer position, suggesting that answer position may be implicitly planned during generation. Building on this insight, we apply activation steering to manipulate internal representations and influence answer position. Our results show that steering can partially control positional preferences and substantially shift answer position distributions. Our findings provide a practical framework for studying implicit positional planning in LLMs and highlight the importance of controllable generation for reliable MCQ construction and evaluation.
[NLP-61] he Cylindrical Representation Hypothesis for Language Model Steering ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中控制技术(steering)效果不稳定、难以预测的问题。现有理论多基于线性表示假设(Linear Representation Hypothesis, LRH),其理想化地假设概念可被正交化以实现无损控制,但在实际模型表示中这一假设失效,无法解释观测到的不可预测性。论文的关键解决方案是提出圆柱形表示假设(Cylindrical Representation Hypothesis, CRH):在不强制正交性的前提下,CRH揭示了概念控制具有一个中心轴(central axis)和一个周围法平面(normal plane)的结构——中心轴主导概念生成,法平面决定控制敏感度,且仅在特定敏感区域(sensitive sectors)能有效激活目标概念,其余区域可能抑制或延迟激活。由于敏感区域无法从差异向量中可靠识别,导致样本层面的不确定性,从而为控制结果波动提供了理论依据,并为真实场景下的模型控制行为提供了一种可解释、实用的分析框架。
链接: https://arxiv.org/abs/2605.01844
作者: Lang Gao,Jinghui Zhang,Wei Liu,Fengxian Ji,Chenxi Wang,Zirui Song,Akash Ghosh,Youssef Mohamed,Preslav Nakov,Xiuying Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026
Abstract:Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH’s orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: this https URL.
[NLP-62] RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
【速读】: 该论文旨在解决奖励模型(Reward Model, RM)在语言模型对齐中的泛化能力评估问题,即RM能否正确排序响应以匹配多样化的用户偏好。现有基准通常基于单一普遍偏好设计,无法有效衡量RM在不同用户情境下的适应性。解决方案的关键在于构建RMGAP基准,包含1,097个跨Chat、Writing、Reasoning和Safety领域的实例;针对每个原始提示,生成四种具有不同语言特征的响应,并通过构造对比场景使其中一种响应成为唯一合适选项,从而显式体现用户偏好差异;同时为每条提示扩展两个句法变体以捕捉同一偏好表达方式的多样性。此设计显著提升了对RM泛化性能的评估效度,实验证明当前最优RM仅达49.27% Best-of-N准确率,揭示了该领域亟待改进的空间。
链接: https://arxiv.org/abs/2605.01831
作者: Yangyang Zhou,Yi-Chen Li
机构: Beijing University of Posts and Telecommunications; National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures
Abstract:Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By “generalizability”, we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state-of-the-art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best-of-N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at this https URL.
[NLP-63] he Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Dont NEURIPS2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在执行任务时存在的“合规缺口”(Compliance Gap)问题,即模型在口头承诺遵守指令后实际行为却违背指令的现象,这种现象独立于事实准确性与语言内容的实质性,且无法通过纯文本观察检测。解决方案的关键在于识别并量化这一缺口:首先,通过理论证明(Theorem 1 和 Theorem 2)揭示其结构必然性和不可检测性;其次,借助大规模实验(13个实验、2031次会话)验证预测,并提出构建专门用于过程合规性评估的新基准——BS-Bench,包含七种工具调用日志审计指标和公开排行榜,从而推动AI部署环境从仅关注结果一致性转向对行为过程的可追溯性治理。
链接: https://arxiv.org/abs/2605.01771
作者: Kwan Soo Shin
机构: PolymathMinds AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Main paper plus appendices and supplementary material. Companion supplementary material with full proofs of Theorems 1 and 2 (RLHF Goodhart Inevitability; DPI Undetectability) included as ancillary file. Submitted to NeurIPS 2026 Evaluations Datasets (ED) Track. Code and data: this https URL
Abstract:An auditor instructs an AI assistant: “open each file individually using the Read tool – no scripts, no agents.” The AI replies “Yes” – then issues a single batched call summarizing all fifty files at once. We call this the Compliance Gap: a third, orthogonal axis of AI honesty distinct from factual truthfulness and rhetorical substance. Three questions: does this verbal-behavioral disconnect exist (existence); can any text-only observer recover it (detectability); what infrastructure does AI deployment need (remedy)? Some 75 benchmarks (IFEval, SWE-bench, BFCL, COMPASS, SpecEval) measure outcome fidelity; none measures process fidelity. Theorem 1 shows the gap is structurally inevitable under RL that rewards text without observing behavior. Theorem 2, via the Data Processing Inequality, shows it is undetectable from text alone – by any human or LLM observer, present or future. Thirteen experiments and 2,031 sessions on six frontier models confirm both predictions. Under default framing, all six exhibit instruction compliance rates of 0% – Claude Sonnet 4 verbally agrees ten out of ten times then bypasses in all ten. The gap is selective: 97% compliance where rationale is rewarded (audit trails), 0-4% where it is not (file reading, privacy masking); removing delegation tools raises compliance to 75% (Cohen’s d = 2.47), confirming environmental affordance rather than weight-encoded failure. Nine blinded human raters achieve Fleiss’ kappa = 0.130 and correctly identify zero of fifteen compliant sessions, exactly as Theorem 2 predicts. Where humans show 47% intention-behavior gaps in psychology and 96.5pp gaps in surgical audits, RLHF-trained models approach 100% under default conditions – a regime warranting its own measurement infrastructure. We release BS-Bench: the first open benchmark for process compliance, with seven tool-call-log audit metrics and a public leaderboard.
[NLP-64] Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models)在长文本生成中因错误累积而导致幻觉(hallucination)频发的问题,尤其针对现有方法在事实性(factuality)提升上受限于“探索-承诺耦合”范式(coupled exploration-commitment paradigm)的缺陷——即中间推理过程无条件传递至最终输出,难以实现细粒度的信息选择与整合控制。解决方案的关键在于提出“探索-承诺解耦”范式(Exploration-Commitment Decoupling paradigm),通过构建校准感知生成框架(Calibration-Aware Generation, CAG),在中间推理阶段引入校准后的可靠性估计(calibrated reliability estimates),并在最终输出时优先保留高置信度内容,从而实现边探索边判断、谨慎承诺的生成机制。实验表明,CAG 在五个长文本事实性基准测试中将事实性提升最高达13%,同时 decoding 时间减少最多37%。
链接: https://arxiv.org/abs/2605.01749
作者: Wen Luo,Guangyue Peng,Liang Wang,Nan Yang,Wei Li,Yuhan Song,Shaohang Wei,Feifan Song,Furu Wei,Houfeng Wang
机构: Peking University (北京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emphcoupled exploration-commitment paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbfExploration-Commitment Decoupling paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbfCalibration-Aware Generation (CAG), a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.
[NLP-65] NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
【速读】: 该论文旨在解决受监管语言数据资产(governed language data assets)在在线定价过程中面临的成本不确定性问题,即平台在缺乏精确隐私或访问成本信息的情况下进行定价决策,可能导致收益损失或风险增加。解决方案的关键在于提出一种名为 \textscNH-CROP 的截断鲁棒定价框架,其核心机制是引入“无害信息获取门控”(no-harm information-acquisition gate),仅当获取精细化成本信号的预期决策价值超过不验证时的最佳替代策略时才触发验证行为。该方法通过对比直接定价、风险感知定价与验证后定价策略,在合成数据、真实代理和下游效用基准上均表现出优于或相当的性能,且因果消融实验表明:在真实代理和效用驱动场景中,定价收益主要来自策略性校准而非付费验证本身,强调了在不确定环境下优先校准定价、仅在信息廉价且可行动时才验证的核心原则。
链接: https://arxiv.org/abs/2605.01745
作者: Xu Zheng,Feiyu Wu,Zhuocheng Wang,Yiming Dai,Hui Li
机构: Xidian University (西安电子科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language data are increasingly acquired and governed as assets, yet platforms often price candidate resources before knowing their true privacy or access costs. We study online pricing for governed language data assets under cost uncertainty. At each round, a platform observes an NLP task, a candidate asset, and a coarse cost estimate, may pay for a refined cost signal, posts a price, and receives safe net revenue. We introduce \textscNH-CROP, a clipped robust pricing framework with a no-harm information-acquisition gate. The method compares direct pricing, risk-aware pricing, and verify-then-price, and acquires information only when its estimated decision value exceeds the best no-verification alternative. Across synthetic, real-proxy, and downstream-utility-grounded benchmarks, clipped \textscNH-CROP variants improve or remain competitive with price-only and risk-aware baselines. Causal ablations show that paid verification is not the main source of gains in real-proxy and utility-grounded settings: the strongest learned policies often choose not to verify. Oracle and high-decision-value diagnostics show that refined cost information can still have substantial local value. Overall, governed language-data platforms should calibrate pricing under uncertain access costs first and verify only when information is cheap and decision-actionable. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.01745 [cs.AI] (or arXiv:2605.01745v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01745 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-66] Less is More: Geometric Unlearning for LLM s with Minimal Data Disclosure
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的隐私与治理需求,即如何实现对特定实体或主题信息的“选择性遗忘”(selective unlearning),在不损害模型整体能力的前提下精准移除指定内容。现有方法通常依赖原始训练语料库,并通过输出层拒绝调优或全局梯度更新实现,导致在遗忘强度、非目标保留性和数据可用性之间难以平衡。论文提出几何遗忘(Geometric Unlearning, GU),其核心创新在于无需访问原始训练数据,而是直接作用于模型推理时的隐藏规划状态(prompt-time planning states);通过少量安全参考提示(safe reference prompts)提取紧凑的低秩安全行为几何结构,并利用轻量级锚定合成提示(anchor-in-context synthetic prompts)触发隐藏表示的局部投影对齐,从而实现精准、高效的遗忘;同时引入教师蒸馏正则项约束非目标锚点,减少副作用漂移。实验表明,GU在ToFU和UnlearnPII等隐私导向基准上实现了强目标抑制且对非目标性能影响最小。
链接: https://arxiv.org/abs/2605.01735
作者: Chenchen Tan,Xinghao Li,Shujie Cui,Youyang Qu,Cunjian Chen,Longxiang Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 6 Figures
Abstract:As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM’s general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model’s prompt-time planning states without access to the original training corpus. GU distills a compact, low-rank geometry of desired safe behavior from a small set of safe reference prompts, and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden planning representations to this safe geometry. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.
[NLP-67] EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限环境中部署时面临的计算与内存开销过大的问题,以及现有知识蒸馏方法因对所有token一视同仁而导致的知识传递效率低下和学习效果下降的问题。其解决方案的关键在于提出一种基于熵的自适应蒸馏策略,通过教师模型输出的熵动态调整训练过程:首先设计了一种token级课程学习机制,训练初期聚焦低熵token,后期逐步转向高熵token;其次根据token熵动态调节蒸馏温度以更好捕捉教师模型的置信度模式;最后采用双分支架构,在简单token上进行仅logits蒸馏以提升效率,而在困难token上则使用更深层次的特征蒸馏以增强知识迁移效果。
链接: https://arxiv.org/abs/2605.01732
作者: Hao Zhang,Zhibin Zhang,Guangxin Wu,Wanyi Ning,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages
Abstract:Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher’s output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training. We further adjust the distillation temperature based on token entropy to better capture teacher confidence patterns. Moreover, we employ a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens. Extensive experiments validate the soundness and effectiveness of our method.
[NLP-68] SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25 Sign Languages
【速读】: 该论文旨在解决当前手语资源在开放世界场景下难以直接支持多语言手语识别、翻译及基于姿态驱动的视频生成任务的问题。现有数据集通常仅提供原始视频与文本的对齐监督,且多源于实验室环境,导致模型对背景或服饰等外观变化敏感,无法适应真实复杂场景;同时,现代基于姿态的生成框架(如使用DWPose作为控制接口)缺乏适配的手语数据资源。解决方案的关键在于构建一个大规模、多语言、以姿态为本原的数据集SignVerse-2M,通过统一的DWPose预处理流程将原始视频转换为2D姿态序列,从而实现对真实世界多样性视频的去外观化表示,并兼容当前主流的姿态驱动建模与生成范式。
链接: https://arxiv.org/abs/2605.01720
作者: Sen Fang,Hongbin Zhong,Yanxin Zhang,Dimitris N. Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages. Project Page at: this https URL
Abstract:Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.
[NLP-69] CDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis IJCAI2026
【速读】: 该论文旨在解决对话式方面情感四元组分析(DiaASQ)中多轮对话复杂关系建模的问题,现有方法存在结构噪声干扰、忽略对话时序性以及位置编码无法清晰分离词级句法顺序与话语级进展的问题。解决方案的关键在于提出融合线程约束有向无环图(TC-DAG)与话语感知旋转位置嵌入(D-RoPE)的新框架:TC-DAG通过线程约束过滤跨线程噪声,利用根锚定保持全局连通性并引入对话时序;D-RoPE采用双流投影与多尺度频率信号对齐多层语义,借助树状距离捕捉线程依赖,并通过引入话语级进展缓解词级距离稀释问题,从而显著提升模型性能。
链接: https://arxiv.org/abs/2605.01717
作者: Xinran Li,Xinze Che,Yifan Lyu,Zhiqi Huang,Xiujuan Xu
机构: Dalian University of Technology (大连理工大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2026 (Main Track)
Abstract:Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) needs to capture the complex interrelationships in multiple rounds of dialogues. Existing methods usually employ simple Graph Convolutional Networks (GCN), which introduce structural noise and fail to consider the temporal sequence of the dialogues, or use standard RoPE, which implicitly captures relative distances in a flat sequence but cannot clearly separate the token-level syntactic order from the utterance-level progression, and may suffer from the Distance Dilution problem. To address these issues, we propose a new framework that combines Thread-Constrained Directed Acyclic Graph (TC-DAG) and Discourse-Aware Rotary Position Embedding (D-RoPE). Specifically, TC-DAG filters out cross-thread noise based on thread constraints, maintains global connectivity through root anchoring, and incorporates the temporal sequence of the dialogues. D-RoPE aligns multi-layer semantics using dual-stream projection and multi-scale frequency signals, captures thread dependencies using tree-like distances, and alleviates the token-level Distance Dilution problem by incorporating utterance-level progressions. Experimental results on two benchmark datasets demonstrate that our framework achieves state-of-the-art performance.
[NLP-70] he Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
【速读】: 该论文旨在解决多智能体辩论(Multi-agent Debate, MAD)及封闭系统推理中普遍存在的“推理陷阱”(Reasoning Trap)问题,即尽管答案准确性得以维持,但推理过程的质量显著下降,导致生成结果缺乏可解释性和证据支撑。其核心问题是:在基于语言模型的多智能体交互中,如何确保推理链与证据(E)之间保持信息一致性,避免因迭代传递而丢失关键证据信息。解决方案的关键在于提出一个三部分框架——(i)支持忠实度评分(SFS),通过原子级断言验证与证据的匹配性,实现对推理链的分解式评估;(ii)证据 grounded 的苏格拉底式推理(EGSR),以非对抗性、证据驱动的问题引导替代传统辩论逻辑;(iii)定理1(DPI Bound)揭示了在标准MAD下推理链 E−O0−O1−… 满足马尔可夫性质并受数据处理不等式约束,从而理论上证明信息熵随步骤递减,导致推理失真。实证表明,EGSR能有效恢复被MAD破坏的信息完整性,且SFS指标可稳定衡量推理忠实度,为构建可靠、透明的生成式AI推理系统提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2605.01704
作者: Kwan Soo Shin
机构: PolymathMinds AI Lab (PolymathMinds AI 实验室)
类目: Computation and Language (cs.CL)
备注: 23 pages, 18 figures, 4 tables, 126 references. Subtitle: A Falsifiable Theorem, the Multi-Agent-Debate Instantiation, and a Triple Failure of Human Reliability
Abstract:When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other’s outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers. We name the multi-agent case the Debate Trap and the broader phenomenon the Reasoning Trap, offering a programmatic theory of evidence-grounded reasoning this http URL framework has three parts: (i) SFS (Supported Faithfulness Score), a claim-level metric verifying decomposed atomic claims against provided evidence (decomposer-invariant rankings: Spearman rho=1.0); (ii) EGSR (Evidence-Grounded Socratic Reasoning), replacing adversarial argumentation with evidence-grounded inquiry; (iii) Theorem 1 (DPI Bound): under standard MAD, the chain E - O^0 - O^1 - … is Markov, and the Data Processing Inequality implies E[I(E;O^t+1)] = E[I(E;O^t)]. Three companion results – open-system recovery (Theorem 2), EGSR accumulation (Lemma 2), and vote-aggregation floor (Proposition 1) – partition multi-step LLM reasoning by its information-theoretic relationship to E. Across 16 conditions on SciFact (300 claims) and FEVER (1,000 claims), DebateCV (C13) preserves 88% of baseline accuracy while SFS drops 43%; majority-vote MAD (C15) reduces SFS to 1.7% of baseline (p 10^-6, d = -0.96); EGSR recovers 98%. An R6 cohort study (Korean n=10x30 FEVER; English n=3x200 SciFact) finds inter-rater Fleiss kappa = +0.018 with 0.8-1.4 Likert intra-rater shifts across language and domain – the human agreement that faithfulness metrics have been calibrated against is not itself stable. We offer one falsifiable conjecture: any closed-system reasoning protocol preserving Theorem 1’s Markov structure is, in expectation, subject to the same DPI bound.
[NLP-71] BIM Information Extraction Through LLM -based Adaptive Exploration
【速读】: 该论文旨在解决从建筑信息模型(BIM)中提取特定信息的难题,现有方法通常采用静态查询生成策略,即假设BIM数据具有固定的组织结构,但这种假设在面对BIM异构性时往往失效。解决方案的关键在于提出一种新的范式——自适应探索(adaptive exploration),其中基于大语言模型(LLM)的智能体通过迭代执行代码来动态发现BIM模型的运行时结构,而非预先假设其结构。实验表明,该方法在ifc-bench v2基准上显著优于所有静态查询生成配置,证明了在范式层面应对BIM异构性比优化静态方法更为有效。
链接: https://arxiv.org/abs/2605.01698
作者: Sylvain Hellin,Suhyung Jang,Stefan Fuchs,Stavros Nousias,André Borrmann
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Automation in Construction
Abstract:BIM models provide structured representations of building geometry, semantics, and topology, yet extracting specific information from them remains remarkably difficult. Current approaches translate natural language into structured queries by assuming a fixed data organization (static approach), which BIM heterogeneity eventually invalidates. We address this with a new paradigm, adaptive exploration, where an LLM-based agent iteratively executes code to extract information from a BIM model, discovering its structure at runtime instead of assuming it. We evaluate this approach on ifc-bench v2, an open-source BIM question-answering benchmark introduced alongside this work, comprising 1,027 tasks across 37 IFC models from 21 projects. A factorial ablation across two LLM capability levels and four augmentation strategies shows that adaptive exploration significantly outperforms static query generation across all configurations, regardless of the augmentation strategy. These results indicate that BIM heterogeneity is best addressed at the paradigm level, not by further optimizing static approaches.
[NLP-72] GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
【速读】: 该论文旨在解决长时对话代理(long-horizon conversational agents)中因记忆系统检索到的片段为非结构化文本而导致复杂推理能力受限的问题。现有方法将检索到的信息直接输入语言模型,缺乏关系、时间与主题结构,难以支持深度推理。解决方案的关键在于提出一种即插即用的结构化记忆模块 GRAVITY(Generation-time Relational Anchoring Via Injected Topological Memory),其在生成阶段从原始对话中提取三类互补的知识表征:基于关系图的实体画像(entity profiles)、链接成因果链条的时间事件元组(temporal event tuples)以及跨会话的主题摘要(cross-session topic summaries),并将其作为结构化锚定上下文注入主机系统的提示词中,从而在不修改模型架构的前提下,有效整合分散证据以形成连贯且与查询相关的上下文,显著提升推理性能。
链接: https://arxiv.org/abs/2605.01688
作者: Yushi Sun,Bowen Cao,Dong Fang,Lingfeng Su,Wai Lam
机构: LIGHTSPEED, Shenzhen, China; The Chinese University of Hong Kong, Hong Kong, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon conversational agents rely on memory systems with increasingly sophisticated retrieval mechanisms. However, retrieved fragments are typically fed to the language model as unstructured text, lacking the relational, temporal, and thematic structures essential for complex reasoning. To bridge this reasoning gap, we introduce GRAVITY (\textbfGeneration-time \textbfRelational \textbfAnchoring \textbfVia \textbfInjected \textbfTopological Memor\textbfY), a plug-and-play structured memory module. GRAVITY extracts three complementary knowledge representations from raw conversational utterances: entity profiles grounded in relational graphs, temporal event tuples linked into causal traces, and cross-session topic summaries. At generation time, it injects these representations into the host system’s prompt as structured anchoring contexts. This approach effectively synthesizes scattered evidence into a coherent, query-relevant context without requiring any architectural modifications to the host model. Extensive evaluations across five diverse memory systems on the LongMemEval and LoCoMo benchmarks demonstrate the efficacy of our approach. On average, GRAVITY improves LLM-judge accuracy by 7.5–10.1%. Gains are inversely correlated with baseline strength: the weakest host improves by 12.2% while the strongest still gains 3.8–5.7%. These findings establish structured context anchoring as a broadly effective, architecture-agnostic augmentation paradigm for long-horizon conversational memory.
[NLP-73] MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety ICML2026
【速读】: 该论文旨在解决现有大语言模型(Large Language Model, LLM)安全评估中多轮越狱攻击(multi-turn jailbreak)基准数据集规模小、多样性不足的问题。当前的多轮越狱评测受限于模板化生成方式,难以模拟真实对话场景下的复杂攻击行为,导致对LLM安全性的评估不够全面。解决方案的关键在于提出一个名为MultiBreak的可扩展且多样化的多轮越狱基准,其核心创新是引入一种基于不确定性驱动的主动学习(active learning)流水线:通过迭代微调生成器以产出更强的对抗性提示,并利用不确定性策略进行高质量样本的持续扩充。该方法显著提升了攻击多样性与有效性,最终构建了包含10,389条多轮对抗提示和2,665种不同有害意图的数据集,在DeepSeek-R1-7B和GPT-4.1-mini上分别实现了比次优数据集高出54.0%和34.6%的攻击成功率(Attack Success Rate, ASR),揭示了多轮场景下潜在但易被忽视的模型脆弱性。
链接: https://arxiv.org/abs/2605.01687
作者: Jialin Song,Xiaodong Liu,Weiwei Yang,Wuyang Chen,Mingqian Feng,Xuekai Zhu,Jianfeng Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026, 25 pages, 11 figures, 13 tables
Abstract:We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR) than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.
[NLP-74] CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers
【速读】: 该论文旨在解决将自然语言描述的组合优化问题自动转化为可执行约束规划(Constraint Programming, CP)模型这一关键瓶颈问题,尤其针对大型语言模型(Large Language Models, LLMs)在缺乏测试时oracle验证的情况下易产生语义错误的局限性。其解决方案的核心在于提出CP-SynC(Constraint Programming modeling with Synthesized Checkers),一个基于多智能体协同的零样本建模工作流:通过建模代理生成并迭代优化候选模型,验证代理自动生成语义检查器以提供正确性反馈,并借助选择代理对多条建模路径进行证据聚合,从而有效降低单个LLM输出噪声的影响,显著提升MiniZinc中的约束建模准确性与鲁棒性。
链接: https://arxiv.org/abs/2605.01675
作者: Yuliang Song,Eldan Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Constraint Programming (CP) is a powerful paradigm for solving combinatorial problems, yet translating natural language problem descriptions into executable models remains a significant bottleneck. While Large Language Models (LLMs) show promise in automating this translation, they often struggle with subtle semantic errors in the absence of oracle validation at test time. To address this, we introduce CP-SynC (Constraint Programming modeling with Synthesized Checkers), a multi-agent workflow for zero-shot constraint modeling in MiniZinc. CP-SynC coordinates modeling agents that generate and refine candidate models and validation agents that synthesize semantic checkers to provide feedback on semantic correctness. To mitigate noise inherent in individual LLM outputs, CP-SynC explores multiple modeling trajectories in parallel and employs selection agents to select the final model via multi-agent evidence aggregation. Extensive experiments on a benchmark of 100 CP problems show that CP-SynC substantially outperforms existing baselines in MiniZinc modeling.
[NLP-75] Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection
【速读】: 该论文旨在解决当前训练-free AI文本检测方法性能受限的问题,这类方法主要依赖模型的词元概率(log-probabilities),但其效果存在理论上限,因为大语言模型在RLHF(Reinforcement Learning from Human Feedback)优化下已趋向生成与人类相似的概率分布。为此,作者提出一种基于字符分布特征(character distribution signatures)的新检测信号,其核心在于利用AI模型在大规模领域均衡语料上训练时逼近全局字符模式、而人类文本则表现出领域特异性分布的差异——形成所谓的“分离墙”(Wall of Separation)。关键创新包括:构建包含4种模型、5个领域、3种温度设置和3种对抗策略的MDTA基准数据集(共642,274个样本),并提出字母分布得分(Letter Distribution Score, LD-Score),该指标与困惑度类方法相关性极低(r = 0.08–0.13),且通过非线性分类器融合至DNA-DetectLLM、Binoculars和FastDetectGPT后,在AUROC和F1指标上均取得稳定提升,尤其在词汇受限的专业领域表现显著增强。
链接: https://arxiv.org/abs/2605.01647
作者: Priyadarshan Narayanasamy,Swastik Agrawal,Klint Faber,Fardina Fathmiul Alam
机构: University of Maryland, College Park(马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注: 11 figures, 10 tables, 24 pages, Under Review at COLM 2026
Abstract:Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a “Wall of Separation” where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3 adversarial strategies, substantially expanding the HC3 dataset with modern model responses, temperature variation, and adversarial augmentation. We introduce the Letter Distribution Score (LD-Score), demonstrating low correlation (r = 0.08-0.13) with perplexity methods. When integrated with DNA-DetectLLM, Binoculars and FastDetectGPT via a non-linear classifier, LD-Score yields consistent improvements in AUROC and F1, with particularly pronounced gains in specialized domains where vocabulary constraints amplify the detection signal. The MDTA dataset can be accessed at: this https URL.
[NLP-76] Prescriptive Scaling Laws for Data Constrained Training
【速读】: 该论文旨在解决在训练计算资源(training compute)日益超越高质量数据可用性背景下,如何在数据受限条件下最大化模型性能的问题。传统Chinchilla scaling law假设所有训练token均唯一,难以指导数据稀缺场景下的预训练决策。其解决方案的关键在于引入一个简单的加性过拟合惩罚项(additive overfitting penalty),用于建模重复数据带来的额外损失,并据此提出新的缩放定律。该定律揭示了在某一临界点后继续重复数据反而有害,此时将计算资源投入模型容量提升更为有效,从而提供了一种全新的、计算最优的资源配置建议。此外,该单参数形式可直接比较不同训练配置下的过拟合程度,例如发现强权重衰减(λ=1.0)能使过拟合系数降低约70%,为数据受限场景下最优权重衰减远高于常规实践提供了缩放律解释。
链接: https://arxiv.org/abs/2605.01640
作者: Justin Lovelace,Christian Belardi,Srivatsa Kundurthy,Shriya Sudhakar,Kilian Q. Weinberger
机构: Cornell University (康奈尔大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law’s recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ( \lambda=1.0 ) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.
[NLP-77] Prosa: Rubric-Based Evaluation of LLM s on Real User Chats in Brazilian Portuguese
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用整体式评分(holistic scoring)进行排名时对评判模型(judge model)偏倚敏感的问题,这导致评估结果不稳定且缺乏可复现性。解决方案的关键在于采用基于二元评分标准(binary rubric scoring)并结合多评判者过滤(multi-judge filtering)的机制:通过将评判任务分解为多个细粒度维度,并由多个不同来源的模型共同筛选评分,显著降低对单一评判模型的依赖,从而提升排名的一致性和判别能力。实验表明,在巴西葡萄牙语多轮对话基准 Prosa 上,该方法使三位来自不同模型家族的评判者在所有 16 个模型的排名上达成一致,而传统整体评分仅在 7 个模型上达成共识,同时评分间隔平均扩大 47%,增强了评估的区分度。
链接: https://arxiv.org/abs/2605.01630
作者: Roseval Malaquias Junior,Giovana Kerche Bonás,Thales Sales Almeida,Hugo Abonizio,Thiago Laitz,Ramon Pires,Marcos Piau,Celio Larcher,Rodrigo Nogueira
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa’s discriminative power. Evaluating a new model on Prosa costs approximately 2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.
[NLP-78] Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对提示(prompt)微小扰动敏感的问题,尤其关注现有鲁棒性方法通常仅在全序列层面强制一致性,从而可能忽略关键实体、关系或结论上的语义漂移这一重要失效模式。解决方案的关键在于提出一种分段级(segment-level)鲁棒性框架 S²R²,其核心包括:1)将干净与扰动生成结果分解为语义片段,并通过最优传输(optimal transport)目标进行对齐;2)惩罚语义漂移最大的片段,实现细粒度的鲁棒优化;3)引入基于分段注意力重分配动机的适配器稳定性正则项,利用LoRA范数控制作为扰动放大证据偏移的可计算代理;4)从PAC-Bayesian复杂度视角解释为何限制适配器增长有助于提升对未观测扰动的泛化能力。
链接: https://arxiv.org/abs/2605.01605
作者: Zhuoyun Li,Boxuan Wang,Jinwei Hu,Zhenglin Huang,Qisong He,Xinmiao Huang,Guangliang Cheng,Xiaowei Huang,Yi Dong
机构: University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Large language models are sensitive to minor prompt perturbations, yet existing robustness methods usually enforce consistency at the whole-sequence level. This holistic view can hide an important failure mode: a perturbed response may remain globally similar to the clean one while drifting on a critical entity, relation, or conclusion. We introduce S ^2 R ^2 , a segment-level framework for robust LoRA fine-tuning. S ^2 R ^2 decomposes clean and perturbed generations into semantic segments, aligns them with an optimal-transport objective, and penalises the segments with the largest meaning drift. To connect this output-side objective with model adaptation, we add an adapter-stability regulariser motivated by segment-level attention reallocation, using LoRA norm control as a tractable proxy for limiting perturbation-amplified evidence shifts. A PAC-Bayesian complexity view further explains why controlling adapter growth may support transfer beyond observed perturbations. Experiments on summarisation benchmarks show that S ^2 R ^2 improves robustness under typographical noise, deletion, synonym replacement, and paraphrasing, while maintaining competitive clean performance and stronger cross-dataset transfer than consistency-based baselines.
[NLP-79] Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection SEMEVAL-2026
【速读】: 该论文旨在解决人工智能生成代码(AI-generated code)的检测与来源归属问题,具体包括两个子任务:一是二分类判断代码是否由人类编写或由AI生成(Subtask-A),二是对11类不同生成模型进行代码归属分类(Subtask-B)。解决方案的关键在于针对不同子任务设计差异化的优化策略:对于Subtask-A,采用留一语言交叉验证、代码增强、分块推理与截尾均值聚合以及困难数据阈值校准;对于Subtask-B,则引入夹心token打包(sandwich token packing)、类别平衡损失函数和多种子集成结合测试时增强(test-time augmentation)。这些方法显著提升了模型在检测准确性和模型归属判别上的性能表现。
链接: https://arxiv.org/abs/2605.01596
作者: Jany-Gabriel Ispas,Sergiu Nisioi
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL)
备注: Archaeology at SemEval-2026 Task 13
Abstract:This paper describes the system submitted by team \textbfArchaeology to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).
[NLP-80] Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents : A Safety-Gated MCP Architecture
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)驱动的代码生成代理在长期软件工程任务中因静态向量存储或通用检索增强生成(Retrieval-Augmented Generation, RAG)机制不足而导致的记忆失效问题,尤其在RL代码开发场景下,微小细节可能影响贝尔曼目标(Bellman targets)、终止掩码(terminal masks)、梯度流动或验证声明。解决方案的关键在于提出一种本地优先、基于Model Context Protocol (MCP) 原生的“强化学习开发者记忆”(RL Developer Memory)架构,其核心机制包括:通过issue_match对候选记忆进行排序并记录遥测数据,issue_feedback将原始标签映射为有界奖励,以及issue_record_resolution将已验证的解决方案与早期检索事件关联;同时引入确定性排序器与上下文Bandit残差策略的协同设计,在影子模式下运行并通过保守的离策略评估(Off-Policy Evaluation, OPE)门控控制行为变更,从而实现可审计、具显式断言边界的记忆控制结构,而非通用编码代理性能提升。
链接: https://arxiv.org/abs/2605.01567
作者: Mehmet Iscan
机构: PythaLab(PythaLab); Yildiz Technical University (伊斯坦布尔理工大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 5 figures, 7 tables. Preprint. Implementation and supplementary artifacts are available at the project repository
Abstract:Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic retrieval-augmented generation (RAG) are insufficient for reinforcement-learning (RL) code development, where small details can alter Bellman targets, terminal masks, gradient flow, or validation claims. This paper presents RL Developer Memory, a local-first, Model Context Protocol (MCP)-native developer-memory architecture for RL coding agents. It treats memory selection as a logged contextual decision process: issue_match ranks candidates and records telemetry, issue_feedback maps raw labels to bounded rewards, and issue_record_resolution links verified resolutions to earlier retrieval events. A deterministic ranker remains deployed, while a contextual-bandit residual policy runs in shadow mode and can affect canary behavior only through conservative off-policy-evaluation (OPE) gates. RL/control memories require theory-to-code metadata and review-gated governance. The system is evaluated on a deterministic 200-case benchmark with RL algorithm bugs, hard negatives, review-gated RL/control cases, and low-risk failures. In the same-commit comparison, deterministic control and full shadow/OPE both achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration adds learning telemetry rather than accuracy gain. Static validation passed 11/11 checks; dynamic integration passed 10/10 cases. The evidence reports limits: active learned-policy deployment and official-client MCP interoperability are unsupported, live full-configuration latency regresses, and 40 residual non-RL failures remain. The contribution is an auditable memory-control architecture with explicit claim boundaries, not a universal coding-agent improvement claim.
[NLP-81] he grip of grammar on meaning uncertainty: cross-linguistic evidence neural correlates and clinical relevance
【速读】: 该论文旨在解决语言中词义不确定性如何通过语法结构得到压缩与调节的问题,即探讨语法在跨语言层面上如何降低孤立词义的模糊性,并揭示其神经基础及在语言障碍中的异常表现。解决方案的关键在于提出并量化“不确定性压缩”(uncertainty reduction)这一概念:通过比较基于词汇频率的非上下文意外度(non-contextual surprisal)与基于语法敏感模型的上下文意外度(contextual surprisal),发现语法显著降低了词义不确定性;该压缩效应在叙事文本中普遍存在,且与句法复杂性和语义丰富性正相关,并在fMRI中映射到理解与产出任务中重叠但分离的脑区活动;更重要的是,这种不确定性压缩在失语症、痴呆和精神分裂症中显著受损,而在非语言主导缺陷的疾病中保持完整,从而将语法驱动的不确定性压缩确立为语言处理的核心机制及其神经基础。
链接: https://arxiv.org/abs/2605.01537
作者: Rui He,Claudio Palominos,Samuele Vallisa,Ni Yang,Han Zhang,Miguel Ángel Santos Santos,Neguine Rezaii,Sergi Valero,Yonghua Huang,Huan Li,Hong Jiang,Yongjun Peng,Maria Francisca Alonso-Sánchez,Frederike Stein,Tilo Kircher,Philipp Homan,Iris Sommer,Lena Palaniyappan,Wolfram Hinzen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Isolated word meanings are inherently uncertain. This uncertainty reduces when they are combined and anchored in context. We propose that grammar compresses meaning uncertainty cross-linguistically, which is reflected in brain and selectively disrupted in disorders. Compression was operationalized as the relative difference between non-contextual surprisal estimated from lexical frequency, and contextual surprisal from grammar-sensitive models. In narratives from 20 languages, contextual surprisal reduced frequency-based surprisal. This reduction closely tracked the surprisal cost of reversing word order, and scaled with richer, non-redundant lexis as organized by more complex but optimal dependency structure. During fMRI, surprisal and its reduction explained BOLD activity for comprehension and production in overlapping but distinct regions. Uncertainty reduction was significantly attenuated in aphasia, dementia, and schizophrenia, but remained intact where primary deficit is not language. These findings position uncertainty reduction via grammar as a foundational concept that illuminates principles, brain basis, and disruptions of language.
[NLP-82] MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂推理任务中因视觉感知错误和幻觉导致的答案准确性下降问题。现有基于可验证奖励的强化学习方法(Reinforcement Learning with Verifiable Rewards, RLVR)存在两个关键局限:一是大量采样预算被浪费在早期视觉描述错误即注定失败的轨迹上;二是稀疏奖励无法区分失败是源于视觉感知阶段还是推理阶段。解决方案的关键在于提出一种解耦框架MIRL,其核心创新是利用生成描述与视觉输入之间的互信息(Mutual Information, MI)作为低成本预筛选信号,实现智能预算分配——通过分叉策略聚焦高潜力轨迹,并通过解耦训练分别提供基于MI的奖励以优化视觉感知模块,从而克服奖励盲区问题。
链接: https://arxiv.org/abs/2605.01520
作者: Yin Zhang,Jiaxuan Zhao,Zonghan Wu,Zengxiang Li,Junfeng Fang,Kun Wang,Qingsong Wen,Yilei Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: this https URL.
[NLP-83] FT-RAG : A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理结构化表格数据时表现不佳的问题,其核心瓶颈在于检索粒度粗粒度和表格语义理解不足。解决方案的关键在于提出FT-RAG框架,通过将表格分解为细粒度的语义单元构建结构化图谱,并引入结构邻域扩展机制以识别语义关联实体,再结合多模态融合策略整合表格检索结果上下文,从而实现更精准的事实性推理。此外,作者还构建了Multi-Table-RAG-Lib基准数据集,用于推动复杂跨表推理任务的发展。
链接: https://arxiv.org/abs/2605.01495
作者: Zebin Guo,Weidong Geng,Ruichen Mao
机构: Georgia Institute of Technology (佐治亚理工学院); Zhejiang University (浙江大学); Zhejiang Lab (浙江实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coarse retrieval granularity and insufficient table semantic comprehension. To address these limitations, we introduce FT-RAG, a fine-grained framework that employs knowledge association by decomposing tables into entry-level semantic units to construct a structured graph. FT-RAG employs a structural neighbor expansion mechanism to find semantically connected entities during graph retrieval, followed by multi-modal fusion to consolidate the context of table retrieval results. Further, to address the scarcity of specialized datasets in this domain, we introduce Multi-Table-RAG-Lib, a benchmark comprising 9870 QA pairs with high complexity and difficulty, curated to demand multi-table integration and text-table information fusion for reasoning. FT-RAG surpasses top-performing baselines across all metrics, achieving a 23.5% and 59.2% improvement in table-level and cell-level Hit Rates, respectively. Generation performance also sees a remarkable 62.2% increase in exact value accuracy recall. These metrics verify the framework’s effectiveness in factual grounding across both pure tabular and heterogeneous table-text contexts. Therefore, our method establishes a new state-of-the-art performance for complex reasoning over mixed-modality documents.
[NLP-84] SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
【速读】: 该论文旨在解决当前AI代理在前沿科学推理任务中面临的瓶颈问题,即现有方法依赖于结构化知识图谱或迭代网络浏览构建数据,难以应对领域知识分散、异构且需复杂计算与推理的前沿科研场景。解决方案的关键在于提出SciResearcher框架,这是一个全自动化的智能体数据构建系统,能够融合学术证据驱动的概念性与计算性任务,同时激发信息获取、工具集成推理和长周期规划能力;通过该框架生成的数据进行监督微调与智能体强化学习训练,最终得到的SciResearcher-8B模型在多个前沿科学基准测试中显著优于同类模型,标志着自动化数据构建新范式的实现。
链接: https://arxiv.org/abs/2605.01489
作者: Tianshi Zheng,Rui Wang,Xiyun Li,Yangqiu Song,Tianqing Fang
机构: HKUST(香港科技大学); CUHK(香港中文大学); Tencent AI Lab(腾讯人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 6 figures, 5 tables
Abstract:Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.
[NLP-85] ReMedi: Reason er for Medical Clinical Prediction ACL2026
【速读】: 该论文旨在解决从电子健康记录(Electronic Health Records, EHR)中预测未来临床结局的挑战,这一任务因患者数据的复杂性和异质性而尤为困难。现有方法主要依赖大语言模型(Large Language Models, LLMs)内部对上下文信息的理解能力,并通过知识蒸馏或检索增强生成(Retrieval-Augmented Generation, RAG)来增强医学知识,但未能充分挖掘模型在推理过程中的可解释性与准确性。解决方案的关键在于提出 ReMedi 框架,其核心创新是利用真实结局标签作为提示(hint),通过一个具有挑战性的样本再生机制生成“理由-答案”配对(rationale-answer pairs),并将真实结果引导整合进偏好数据构建循环中,从而提升模型在细粒度推理和预测性能上的表现。实验表明,该方法在多个 EHR 预测任务上相较最先进基线提升了最高达 19.9% 的 F1 分数,验证了其在真实临床场景下的有效性。
链接: https://arxiv.org/abs/2605.01474
作者: Yushi Cao,Yiming Chen,Hongchao Jiang,Hung-yi Lee,Robby T. Tan
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 findings
Abstract:Predicting future clinical outcomes from electronic health records (EHR) remains challenging due to the complexity and heterogeneity of patient data. LLMs have shown strong potential for such predictive tasks, yet existing approaches mainly focus on enhancing medical knowledge through distillation or RAG while relying on the model’s internal ability to interpret contextual information. In this work, we present ReMedi (Reasoner for Medical Clinical Prediction), a framework for improving clinical outcome prediction from EHR. ReMedi generates rationale-answer pairs using a challenging sample regeneration mechanism for complex clinical questions, which leverages ground-truth answers as hints to enhance reasoning for further fine-tuning and preference tuning. ReMedi integrates ground-truth outcome guidance into the preference data construction loop, regenerating rationale-answer variants. By tuning on these rationale-answer pairs, the model improves its predictive performance. Experiments on multiple EHR prediction tasks demonstrate substantial gains of up to 19.9 percent over state-of-the-art baselines in terms of F1 score, underscoring ReMedi’s effectiveness in real-world clinical prediction.
[NLP-86] Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models ATC
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在公共安全高风险场景下——如紧急呼叫分诊与调度决策支持系统中——存在的种族、性别和宗教外观等人口统计学特征偏见问题,此类偏见尚未得到充分评估。其解决方案的关键在于提出一个跨语言审计框架,将警察优先调度系统建模为五级序数分类任务,并采用受控的最小差异对设计(minimal-pair design),以隔离不同人口统计学线索的影响;该框架在19,800次模型输出中验证了偏见在情境模糊时系统性出现、在明确优先级时消失的现象,且发现偏见强度因人口维度(宗教外观 > 性别 > 种族)和语言(英语 vs. 汉语)而异,揭示出语言间的不对称性,从而强调偏见是模型、上下文模糊性和语言交互作用的结果,而非模型固有属性。此框架为部署机构提供了可扩展的审计基础设施,可用于在真实应用前评估候选模型在本地场景中的公平性表现。
链接: https://arxiv.org/abs/2605.01451
作者: William Guey,Wei Zhang,Pierrick Bougault,Yi Wang,Bertan Ucar,Vitor D. de Moura,José O. Gomes
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 7 figures. Submitted to Humanities and Social Sciences Communications (Nature) collection on Artificial Intelligence and Emerging Technologies in Public Safety. Code and data: this https URL
Abstract:Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.
[NLP-87] Hallucinations Undermine Trust; Metacognition is a Way Forward ICML2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在事实准确性方面仍存在的“幻觉”问题,即模型在回答事实性问题时输出错误信息的现象。尽管前沿大语言模型(LLM)在知识覆盖范围上不断扩展,但其对自身知识边界的认知能力仍未显著提升,导致即使在简单问答场景中仍频繁产生不准确回答。论文指出,当前改进主要依赖于增加模型的知识容量,而非增强其识别已知与未知的能力,而后者本质上存在难以克服的局限性,形成消除幻觉与保持实用性的权衡。解决方案的关键在于引入“忠实不确定性”(faithful uncertainty)——将语言层面的不确定性表达与内在的认知不确定性相一致,从而实现元认知(metacognition):让模型能够意识到自身的不确定状态并据此行动。这一机制不仅适用于直接交互中的诚实沟通,也构成代理系统中决策控制层的核心,使模型在保持能力的同时具备可信赖性。
链接: https://arxiv.org/abs/2605.01428
作者: Gal Yona,Mor Geva,Yossi Matias
机构: Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注: To appear in ICML 2026 (Position Track)
Abstract:Despite significant strides in factual reliability, errors – often termed hallucinations – remain a major concern for generative AI, especially as LLMs are increasingly expected to be helpful in more complex or nuanced setups. Yet even in the simplest setting – factoid question-answering with clear ground truth-frontier models without external tools continue to hallucinate. We argue that most factuality gains in this domain have come from expanding the model’s knowledge boundary (encoding more facts) rather than improving awareness of that boundary (distinguishing known from unknown). We conjecture that the latter is inherently difficult: models may lack the discriminative power to perfectly separate truths from errors, creating an unavoidable tradeoff between eliminating hallucinations and preserving utility. This tradeoff dissolves under a different framing. If we understand hallucinations as confident errors – incorrect information delivered without appropriate qualification – a third path emerges beyond the answer-or-abstain dichotomy: expressing uncertainty. We propose faithful uncertainty: aligning linguistic uncertainty with intrinsic uncertainty. This is one facet of metacognition – the ability to be aware of one’s own uncertainty and to act on it. For direct interaction, acting on uncertainty means communicating it honestly; for agentic systems, it becomes the control layer governing when to search and what to trust. Metacognition is thus essential for LLMs to be both trustworthy and capable; we conclude by highlighting open problems for progress towards this objective. Comments: To appear in ICML 2026 (Position Track) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.01428 [cs.CL] (or arXiv:2605.01428v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.01428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-88] Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks
【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)评估中存在的三大挑战:基准测试饱和、数据获取受限以及任务覆盖不全。现有评估套件或已趋于饱和,或高度依赖受限数据集,或缺乏对关键医疗任务的全面覆盖。其解决方案的关键在于提出一个完全开源的评估套件Medmarks,包含30个基准测试,涵盖问答、信息抽取、医学计算和开放式临床推理四大类任务,并通过可验证指标与LLM-as-a-Judge方法对61个模型(共71种配置)进行系统性评估。结果表明,前沿推理模型在各项任务中表现最优,且医学微调模型显著优于通用模型,同时揭示了模型对答案顺序偏差的敏感性,为后续基于强化学习的后训练提供了可直接使用的子集(Medmarks-T)。
链接: https://arxiv.org/abs/2605.01417
作者: Benjamin Warner,Ratna Sagari Grandhi,Max Kieffer,Aymane Ouraq,Saurav Panigrahi,Geetu Ambwani,Kunal Bagga,Nikhil Khandekar,Arya Hariharan,Nishant Mishra,Manish Ram,Shamus Sim Zi Yang,Ahmed Essouaied,Adepoju Jeremiah Moyondafoluwa,Robert Scholz,Bofeng Huang,Molly Beavers,Srishti Gureja,Anish Mahishi,Sameed Khan,Maxime Griot,Hunar Batra,Jean-Benoit Delbrouck,Siddhant Bharadwaj,Ronald Clark,Ashish Vashist,Anas Zafar,Leema Krishna Murali,Harsh Deshpande,Ameen Patel,William Brown,Johannes Hagemann,Connor Lane,Paul Steven Scotti,Tanishq Mathew Abraham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: website: this https URL
Abstract:Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at this https URL
[NLP-89] Who Decides What Is Harmful? Content Moderation Policy Through A Multi-Agent Personalised Inference Framework
【速读】: 该论文旨在解决在线平台内容规模与复杂性增长背景下,传统集中式、自上而下的内容审核机制难以适应用户对危害感知的主观差异这一关键问题。其解决方案的核心在于提出一种基于大语言模型(Large Language Model, LLM)的多智能体个性化推理框架,通过构建包含领域专家智能体(Expert Agents)、管理智能体(Manager Agent)和虚拟身份模拟智能体(Ghost Profile Agent)的协同架构,实现基于个体敏感性特征的内容过滤与决策优化,从而在技术层面显著提升审核准确性(最高达32%),并在政策层面为平台治理提供可扩展的个性化数字权利保障路径。
链接: https://arxiv.org/abs/2605.01416
作者: Ewelina Gajewska,Michal Wawer,Katarzyna Budzynska,Jaroslaw A. Chudziak
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: The paper has been accepted to the 34th European Conference on Information Systems (ECIS 2026). The official paper version will appear in the conference proceedings
Abstract:The increasing scale and complexity of online platforms raises critical policy questions around harmful content, digital well-being, and user autonomy. Traditional content moderation systems rely on centralised, top-down rules, often failing to accommodate the subjective nature of harm perception. This paper proposes an LLM-based multi-agent personalised inference framework that filters content based on unique sensitivity profiles of individual users. Our architecture combines domain-specific Expert Agents, a Manager Agent for orchestrating content analysis and agent selection, and a Ghost Profile Agent for simulating user perspectives, to inform moderation decisions. Evaluated against a range of non-personalised baselines, the system demonstrates up to a 32% improvement in accuracy, showing increased alignment with individual user sensitivities. Beyond technical performance, our framework provides policy-relevant insights for platform governance, providing a scalable way to reconcile moderation policies with societal and individual digital rights
[NLP-90] Injecting Distributional Awareness into MLLM s via Reinforcement Learning for Deep Imbalanced Regression ICML2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长尾目标分布下的数值回归任务中表现不佳的问题。现有方法如基于token的监督微调(Supervised Fine-Tuning, SFT)和点级回归奖励会偏向高密度区域,导致回归均值偏差(regression-to-the-mean)现象,从而损害尾部区域的预测性能。其关键限制在于缺乏跨样本的相对关系监督(cross-sample relational supervision)。为此,作者提出一种基于组相对策略优化(Group Relative Policy Optimization)的分布感知强化学习框架,通过引入基于一致性相关系数(Concordance Correlation Coefficient, CCC)的批级比较奖励,使模型预测分布与真实分布在相关性、尺度和均值上对齐。该方案无需修改模型架构,即可在统一的长尾回归基准测试中显著优于SFT及现有MLLM回归方法,尤其在中等和少样本场景下提升明显。
链接: https://arxiv.org/abs/2605.01402
作者: Yao Du,Shanshan Li,Xiaomeng Li
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
[NLP-91] MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents ACL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期个性化对话中缺乏持久记忆的问题,同时克服现有基于图的记忆系统中存在的信息稀释、溯源追踪缺失以及统一检索忽略查询上下文等缺陷。其解决方案的关键在于提出MemORAI框架,通过三项核心创新实现:一是采用双层压缩的择优记忆过滤机制以保留与用户人格相关的高价值内容;二是构建富含溯源信息的多关系图结构,在回合级别追踪事实来源;三是引入基于动态加权PageRank的查询自适应子图检索方法,实现查询条件驱动的边权重调整,从而提升检索的相关性与个性化程度。
链接: https://arxiv.org/abs/2605.01386
作者: Hung Pham Van,Nguyen Manh Hieu,Khang Pham Tran Tuan,Nam Le Hai,Linh Ngo Van,Nguyen Thi Ngoc Diep,Trung Le
机构: Hanoi University of Science and Technology (河内科技大学); VNU University of Engineering and Technology (越南国家大学工程与技术学院); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: ACL Findings
Abstract:Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.
[NLP-92] A framework for analyzing concept representations in neural models CONLL2026
【速读】: 该论文旨在解决神经模型中人类可解释概念的表征问题,特别是如何系统性地分析和评估概念子空间(concept subspace)的性质。其核心挑战在于现有方法在定义、识别和验证这些子空间时缺乏统一标准,导致结果不一致且难以比较。解决方案的关键在于提出一个统一框架,从两个维度——包含性(containment)与解耦性(disentanglement)——来量化和评估概念子空间的质量:包含性检验某一概念是否仅存在于特定子空间内而不在外部表示中体现,解耦性则衡量该概念与其他概念的分离程度。通过这一框架,作者对五种不同社区提出的估计器进行了系统比较,揭示了不同方法在上述两个维度上的表现差异,并指出当前最优的概念擦除方法LEACE虽在测试指标上表现良好,但在未见数据上仍存在泛化不足的问题,从而为未来研究提供了明确的方向。
链接: https://arxiv.org/abs/2605.01381
作者: Burin Naowarat,Hao Tang,Sharon Goldwater
机构: The Centre for Speech Technology Research; School of Informatics, University of Edinburgh, United Kingdom
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CoNLL 2026
Abstract:Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: \textitcontainment, which tests if a concept is fully represented in a subspace but not outside it, and \textitdisentanglement, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.
[NLP-93] MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation ACL2026
【速读】: 该论文旨在解决现有知识蒸馏方法在压缩大语言模型(Large Language Models, LLMs)时,仅对固定层或词粒度输出进行对齐、忽视表示随深度演化过程的问题,导致学生模型难以有效捕捉教师模型内部的结构关系,从而限制了知识迁移效率。其解决方案的关键在于提出多粒度轨迹对齐(Multi-Granular Trajectory Alignment, MTA)框架,通过层自适应策略实现教师与学生模型在层间变换轨迹上的对齐:低层以词级对齐保留词汇信息,高层以短语级跨度(如名词短语和动词短语)对齐捕获组合语义;并通过动态结构对齐损失(Dynamic Structural Alignment loss)匹配每层内语义单元间的相对几何关系,从而更精准地传递Transformer表示从低层到高层的抽象演化路径。
链接: https://arxiv.org/abs/2605.01374
作者: Pham Khanh Chi,Quoc Phong Dao,Thuat Nguyen,Linh Ngo Van,Trung Le,Thanh Hong Nguyen
机构: Hanoi University of Science and Technology (河内科技大学); Monash University (莫纳什大学); University of Oregon (俄勒冈大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher’s internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.
[NLP-94] Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLMs)在解码过程中未能有效利用其全局上下文建模能力的问题,尤其是现有策略因局部偏好而忽略上下文中的信息密度异质性,从而导致生成质量下降。解决方案的关键在于识别并利用高信息密度(High-Information-Density, HD)token:研究发现,显式地以HD token为条件可显著提升输出质量,且HD token具有早期收敛特性;基于此,作者提出无需训练的解码策略FoCore,通过自对比机制将HD token临时设为负样本以引导生成过程;进一步地,引入高效变体FoCore_A,在检测到HD token收敛后,对局部窗口内的稳定候选进行并行解码,大幅加速生成过程。
链接: https://arxiv.org/abs/2605.01373
作者: Jinyuan Feng,Xin Yu,Yiqun Chen,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Zhiqiang Pu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Renmin University of China (中国人民大学); Xiaohongshu Inc. (小红书公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The iterative denoising paradigm of Diffusion Large Language Models (DLMs) endows them with a distinct advantage in global context modeling. However, current decoding strategies fail to leverage this capability, typically exhibiting a local preference that overlooks the heterogeneous information density within the context, ultimately degrading generation quality. To address this limitation, we systematically investigate high-information-density (HD) tokens and present two key findings: (1) explicitly conditioning on HD tokens substantially improves output quality; and (2) HD tokens exhibit an early-decoding tendency, converging earlier than surrounding tokens. Motivated by these findings, we propose Focus on the Core \textbf(FoCore), a training-free decoding strategy that utilizes HD tokens in a self-contrast manner, wherein HD tokens are temporarily remasked as negative samples, to guide generation. We further introduce FoCore_Accelerate \textbf(FoCore_A), an efficient variant that, upon detecting HD token convergence, performs parallel decoding over stable candidates within a local context window, substantially accelerating generation. Extensive experiments on math, code and logical reasoning benchmarks demonstrate that FoCore consistently improves generation quality and efficiency across both LLaDA and Dream backbones. For instance, on HumanEval, FoCore improves pass@1 from 39.02 to 42.68 over standard Classifier-Free Guidance, while FoCore-A reduces the number of decoding steps by 2.07x and per-sample latency from 20.76s to 8.64s (-58.4%).
[NLP-95] Embedding-based In-Context Prompt Training for Enhancing LLM s as Text Encoders
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成嵌入表示时,因使用上下文学习(In-Context Learning, ICL)引入大量离散文本示例而导致的序列长度增加与计算开销上升的问题。其解决方案的关键在于提出EPIC策略,通过将原始文本示例替换为连续嵌入表示(continuous embeddings),在对比学习中引导模型对语义相关文本对进行对齐,同时要求模型将这些嵌入作为上下文提示的一部分进行理解,从而在训练和推理阶段均显著降低计算负担并提升嵌入质量。
链接: https://arxiv.org/abs/2605.01372
作者: Ailiang Lin,Zhuoyun Li,Keyu Mao,Kotaro Funakoshi,Manabu Okumura
机构: Institute of Science Tokyo (东京科学研究所); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it causes substantial token overhead due to the increased sequence length. In this work, we propose EPIC, a novel embedding-based in-context prompt training strategy that leverages ICL to generate high-quality embeddings while reducing computational burden during both training and inference. This approach replaces discrete text demonstrations with their corresponding continuous embeddings, which not only encourages the LLM to align semantically-related text pairs during contrastive learning, but also requires the model to interpret demonstration embeddings as part of the in-context prompt. Consequently, EPIC-trained models achieve excellent embedding performance both with or without in-context prompts at inference time. Comprehensive experiments demonstrate that our method establishes new state-of-the-art results on the MTEB benchmark, surpassing frontier models trained solely on publicly available retrieval data. Extensive ablation studies further validate the effectiveness and necessity of our mechanism.
[NLP-96] On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成任务中存在严重输出长度波动性(length volatility)的问题,这种波动性不仅导致计算资源浪费,还限制了模型在实际应用中的可靠性。现有研究多关注单次生成质量,忽视了生成结果的稳定性。解决方案的关键在于提出一种轻量级的解码阶段优化策略——通过对 logits 进行 boosting(GLoBo),在不增加额外训练成本的前提下显著提升长文本生成的长度准确性与稳定性:实验表明,该方法可使基础模型的平均输出长度提升148%,同时将长度波动降低69%,且保持高质量生成能力。
链接: https://arxiv.org/abs/2605.01357
作者: Zhitao He,Haolin Yang,Rui Min,Zeyu Qin,Yi R. Fung
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel at long-context understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output. This volatility not only leads to significant computational costs but also severely impacts the models’ reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose Stable Generation via Logits Boosting (GLoBo), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.
[NLP-97] LLM Output Detectability and Task Performance Can be Jointly Optimized
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中水印技术(watermarking)导致下游任务性能下降的问题。现有水印方法通过偏置词元分布嵌入可检测信号,虽具备统计可靠性,但常损害模型在问答、摘要和写作等任务上的表现。其解决方案的关键在于提出 PUPPET 框架,该框架利用强化学习对大语言模型(LLM)进行微调,同时优化两个奖励函数:一个检测器输出机器生成概率,另一个评估任务特定指标。实验表明,PUPPET 在保持高可检测性的同时显著提升下游任务性能,且仅需数千样本即可高效训练(1–2 GPU 小时),并展现出跨领域、跨模型家族和尺寸的鲁棒性,包括对抗改写攻击的能力。
链接: https://arxiv.org/abs/2605.01350
作者: Koshiro Saito,Ryuto Koike,Masahiro Kaneko,Naoaki Okazaki
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design – it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream tasks. The analysis shows that this optimization can be performed efficiently with only a few thousand samples in 1–2 GPU hours. Moreover, these gains are consistent across out-of-domain tasks, different LLM families, and model sizes, and are even robust to paraphrasing attacks.
[NLP-98] MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
【速读】: 该论文旨在解决传统在线蒸馏(On-policy Distillation, OPD)方法中存在的两个核心问题:一是单教师能力上限导致的学生误差继承问题,二是OPD在代理任务(agentic tasks)中因多步误差累积而引发的训练不稳定问题。解决方案的关键在于提出两种创新机制:其一为多智能体辩论驱动的在线蒸馏(MAD-OPD),将单一教师重构为一个通过辩论生成共识决策的集体智能体,每个教师贡献基于事后辩论置信度加权,从而突破单教师性能瓶颈;其二为在线代理蒸馏(OPAD),引入步级采样策略以缓解长轨迹中多步误差传播对训练稳定性的影响。此外,论文还提出了任务自适应散度选择原则,根据任务特性动态选用Jensen-Shannon散度(JSD)或反向KL散度(reverse KL),显著提升了不同下游任务的蒸馏效果与鲁棒性。
链接: https://arxiv.org/abs/2605.01347
作者: Jianze Wang,Ying Liu,Jinlong Chen,Xuchun Hu,Qilong Zhang,Yu Cao,Jun Wang,Hua Yang,Yong Xie,Qianglong Chen
机构: Huazhong University of Science and Technology (华中科技大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 9-page main paper + appendix. 8 figures, 7 tables. Code: this https URL
Abstract:On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student’s on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher’s contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B \to 4B setting it lifts the agentic average by +2.4% and the code average by +3.7% over the stronger single-teacher OPD.
[NLP-99] A Multi-View Media Profiling Suite: Resources Evaluation and Analysis
【速读】: 该论文旨在解决新闻媒体政治偏倚与事实性评估中缺乏统一资源、跨方法全面评估以及在标签稀疏和数据集多样性条件下对表征与融合策略系统性分析的问题,同时填补实证研究中关于哪些方法始终有效、哪些失败及其原因的空白。其解决方案的关键在于:构建大规模标注数据集MBFC-2025(约2600家媒体),建立多视角表征体系(涵盖Alexa图、超链接图、大语言模型生成图、文章内容及维基百科描述),并通过系统性实验验证不同嵌入视图与融合策略(包括基于强化学习的融合变体)的有效性,最终在ACL-2020上取得当前最优结果,并为MBFC-2025建立强基准。
链接: https://arxiv.org/abs/2605.01336
作者: Muhammad Arslan Manzoor,Dilshod Azizov,Daniil Orel,Umer Siddique,Zain Muhammad Mujahid,Yufang Hou,Preslav Nakov
机构: MBZUAI, UAE; Interdisciplinary Transformation University, Austria; University of Texas at San Antonio, USA; University of Copenhagen, Denmark
类目: Computation and Language (cs.CL)
备注:
Abstract:News outlets shape public opinion at a scale that makes automated detection of political bias and factuality essential. However, the field still lacks unified resources, comprehensive evaluations across diverse approaches, and systematic analyses of the representations and fusion strategies that matter most, especially under label sparsity and dataset diversity. In addition, there is little empirical work reporting broad, observation-driven findings about what consistently works, what fails, and why. We address these gaps through four main contributions. First, we introduce MBFC-2025, a large-scale label set covering approximately 2,600 outlets from Media Bias/Fact Check (MBFC). Second, we construct multiview representations for ACL-2020 (Panayotov et al., 2022), which includes around 900 outlets, as well as for MBFC-2025. These representations span Alexa graphs, hyperlink graphs, LLM-derived graphs, articles, and Wikipedia descriptions. Third, we provide a systematic evaluation and analysis of embedding views and fusion strategies, including a reinforcement learning-based fusion variant. Fourth, we conduct extensive experiments that achieve state-of-the-art results on ACL-2020 and establish strong benchmarks on MBFC-2025.
[NLP-100] OralMLLM -Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在牙科影像分析中对多层级认知过程(如感知、理解、预测与决策)的捕捉能力尚不明确的问题。其解决方案的关键在于构建了一个涵盖根尖片、全景片和侧位头颅片三种临床重要影像模态的综合性基准测试体系,定义了四个认知类别,并基于公开数据集设计了27个临床相关任务,辅以人工标注和3820次临床医生评估,系统性地评估了六种前沿MLLMs(包括GPT-5.2和GLM-4.6)的表现,从而揭示了模型与临床医生之间的性能差距、优势与局限性,并识别出常见失败模式,为下一代符合临床认知逻辑、安全规范及工作流复杂性的牙科人工智能系统开发提供方向。
链接: https://arxiv.org/abs/2605.01333
作者: Rongyang Wang,Shuang Zhou,Jiashuo Wang,Wenya Xie,Xiaoxia Che
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 4 tables
Abstract:Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.
[NLP-101] Creating and Evaluating Figurative Language Dataset for Sindhi
【速读】: 该论文旨在解决乌尔都语和印度斯坦语等低资源语言中隐喻语言识别任务缺乏高质量标注数据集的问题,特别是针对信德语(Sindhi)的修辞语言分类问题。解决方案的关键在于构建了一个名为SiNFluD的新基准数据集,该数据集通过从博客、社交媒体和文学来源收集原始文本并进行人工标注,实现了较高的标注一致性(Kappa系数为0.81),同时在多语言预训练模型(如mBERT、XLM-RoBERTa及XLM-RoBERTa-XL)和SetFit少样本微调方法上进行了系统评估,最终表明XLM-RoBERTa-XL在该任务上表现最优,为后续低资源语言的修辞理解研究提供了可靠的数据基础与模型基线。
链接: https://arxiv.org/abs/2605.01323
作者: Wazir Ali,Adeeb Noor,Saifullah Tumrani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In this article, we introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. We first collect raw text from various blogs, social media platforms, and literary sources, and subsequently prepare the corpus for annotation. Two native annotators label the data using the Doccano text annotation tool, achieving an inter-annotator agreement of 0.81. We then establish baseline results using 5-fold and 10-fold cross-validation. Finally, we evaluate mBERT, XLM-RoBERTa, and XLM-RoBERTa-XL models, along with SetFit for few-shot fine-tuning of sentence transformers. Among these, the pretrained XLM-RoBERTa-XL achieves the best performance.
[NLP-102] Benchmarking LightGBM and BiLSTM for Sentiment Analysis on Indonesian E-Commerce Reviews
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中情感分析任务的模型性能优化问题,特别是在印尼电商评论文本分类场景下的模型选择与效果对比。其解决方案的关键在于系统性地比较基于PyCaret AutoML框架的机器学习(Machine Learning, ML)方法(包括LightGBM、逻辑回归和SVM)与深度学习(Deep Learning, DL)方法中的双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)在相同数据集上的表现差异。实验结果表明,BiLSTM凭借对序列上下文信息的强大建模能力,在准确率(98.87%)和F1分数(98.87%)上显著优于所有ML模型,证明其在捕捉印尼语评论文本语义依赖关系方面具有更强的能力,从而成为该任务的最优解。
链接: https://arxiv.org/abs/2605.01322
作者: Lidia Natasyah Marpaung,Vania Claresta,Iqfina Haula Halika,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 2 tables. The paper compares LightGBM, Logistic Regression, and linear SVM against a BiLSTM model for Indonesian e-commerce sentiment analysis using a 15,000-sample dataset from Hugging Face
Abstract:This study presents a comparative analysis between two primary approaches in Natural Language Processing (NLP): Machine Learning (ML) utilizing the PyCaret AutoML framework, and Deep Learning (DL). The evaluation is conducted on a sentiment analysis task using an Indonesian e-commerce review dataset sourced from Hugging Face. The dataset, consisting of 15,000 samples, is partitioned into training, validation, and testing sets. The ML experiments compare LightGBM, Logistic Regression, and Support Vector Machine (SVM) algorithms, whereas the DL experiment implements a Bidirectional Long Short-Term Memory (BiLSTM) architecture. The experimental results demonstrate that the BiLSTM model outperforms all ML models, achieving an accuracy of 98.87% and an F1-Score of 98.87%. Meanwhile, LightGBM emerges as the best-performing ML model with an accuracy of 98.23% in a highly efficient training time. This research proves that the BiLSTM architecture is highly capable of capturing the sequential context of Indonesian review texts, making it the superior model for this specific classification task.
[NLP-103] Sentiment Analysis of Mobile Legends App Reviews Using Machine Learning and LSTM-Based Deep Learning Models
【速读】: 该论文旨在解决移动游戏应用(Mobile Legends)用户评论中情感分析的准确性问题,尤其是针对非正式、依赖上下文的语言特征。其解决方案的关键在于对比传统机器学习方法与基于长短期记忆网络(LSTM)的深度学习模型的效果,发现LSTM通过捕捉文本序列中的时序依赖关系,在处理复杂语义和上下文信息方面具有显著优势,最终实现92%的准确率和91%的加权F1分数,验证了深度学习在非结构化用户评论情感识别中的有效性。
链接: https://arxiv.org/abs/2605.01317
作者: Vira Putri Maharani,Kharisa Harvanny,Daris Samudra,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (ITERA)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, includes comparative evaluation of Machine Learning and LSTM models for sentiment analysis on Indonesian Mobile Legends app reviews, with dataset description, methodology, model architecture, results, discussion, acknowledgments, and references
Abstract:This paper compares Machine Learning and LSTM-based Deep Learning methods for sentiment analysis of Mobile Legends app reviews. Using a dataset of 10,000 reviews labeled as positive, negative, and neutral, the study evaluates traditional models with TF-IDF and PyCaret AutoML and compares them against an LSTM model designed to capture sequential text dependencies. The results show that the LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.
[NLP-104] Enhancing Game Review Sentiment Classification on Steam Platform with Attention-Based BiLSTM
【速读】: 该论文旨在解决Steam游戏评论中的情感分类问题,以帮助开发者更好地理解玩家反馈。其关键解决方案是采用基于注意力机制的双向长短期记忆网络(BiLSTM+Attention)模型,通过类权重交叉熵损失函数处理类别不平衡问题,并利用注意力可视化增强模型的可解释性,最终在测试集上实现了83%的准确率和85%的加权F1分数,对负面评论的召回率达到90%。
链接: https://arxiv.org/abs/2605.01315
作者: Abit Ahmad Oktarian,Fadhil Fitra Wijaya,Dhafin Razaqa Luthfi,Luluk Muthoharoh,Ardika Satria,Martin Clinton Tosima Manullang
机构: Institut Teknologi Sumatera (印尼科技大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 4 figures, and 2 tables. The paper is a research manuscript on sentiment analysis of Steam game reviews, comparing TF-IDF-based machine learning methods with a BiLSTM+Attention deep learning model
Abstract:This paper investigates sentiment classification of Steam game reviews using an attention-based Bidirectional Long Short-Term Memory (BiLSTM) model. Using a dataset of 50,000 reviews sampled from a larger Steam review corpus, the authors compare a traditional machine learning baseline based on TF-IDF and PyCaret AutoML with a deep learning approach implemented in PyTorch. The proposed BiLSTM+Attention model is trained with class-weighted cross-entropy to address class imbalance and achieves 83% accuracy and 85% weighted F1-score on the test set, with 90% recall for negative reviews. The paper also presents attention visualizations to show interpretability by highlighting sentiment-bearing words. The study concludes that the BiLSTM+Attention model is effective for analyzing user sentiment in Steam reviews and useful for helping developers understand player feedback.
[NLP-105] Addressing Data Scarcity in Bangla Fake News Detection: An LLM -Based Dataset Augmentation Approach
【速读】: 该论文旨在解决低资源语言(如孟加拉语)中虚假新闻检测模型因训练数据少且分布不均而导致性能受限的问题。其核心解决方案是提出一种系统性的大语言模型(Large Language Model, LLM)增强框架,利用指令微调的Gemma 3 27B IT模型生成合成孟加拉语新闻文本,并结合语义过滤和受控子采样策略以保持标签一致性与多样性。实验表明,仅对少数类样本进行高比例增强并配合随机子采样可显著提升F1分数(从0.85提升至0.88),验证了精心设计的LLM驱动增强方法在低资源场景下对虚假新闻检测的有效性。
链接: https://arxiv.org/abs/2605.01292
作者: Ahmed Alfey Sani,Kazi Akib Zaoad,Shefayat E Shams Adib,Md Abdul Muqtadir,Ajwad Abrar
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted in 15th ACM ICSCA, 2026 in Langkawi, Malaysia
Abstract:The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.
[NLP-106] GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models
【速读】: 该论文旨在解决指令微调语言模型(instruction-tuned language models)在特定任务适应过程中缺乏有效指导的问题,即现有方法通常将指令微调模型视为被动目标,仅在最终合并阶段才参与,导致训练过程缺乏针对性优化。解决方案的关键在于提出GIFT(Guided Fine-Tuning and Transfer)框架,该框架通过引入来自指令微调模型的置信度信号(confidence signals),对预训练基础模型上的低秩适配器(low-rank adapter)进行引导式微调,从而在保留通用指令遵循能力的同时,实现任务特化的高效迁移。
链接: https://arxiv.org/abs/2605.01256
作者: Zhiwen Ruan,Yichao Du,Jianjie Zheng,Longyue Wang,Yun Chen,Peng Li,Jinsong Su,Yang Liu,Guanhua Chen
机构: Southern University of Science and Technology(南方科技大学); Alibaba Group(阿里巴巴集团); Shanghai University of Finance and Economics(上海财经大学); Tsinghua University(清华大学); Xiamen University(厦门大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:A promising paradigm for adapting instruction-tuned language models is to learn task-specific updates on a pretrained base model and subsequently merge them into the instruction-tuned model. However, existing approaches typically treat the instruction-tuned model as a passive target that is only involved at the final merging stage, without guiding the training process. We propose GIFT (Guided Fine-Tuning and Transfer), a simple and efficient framework that incorporates guidance from the instruction model into task adaptation. GIFT fine-tunes a low-rank adapter on the pretrained base model using confidence signals derived from the instruction-tuned model. The learned adapter is then merged into the instruction-tuned model, yielding task-specialized models that preserve general instruction-following behavior. We evaluate GIFT on mathematical and knowledge-intensive benchmarks across multiple model families and scales. Results show that GIFT consistently outperforms direct fine-tuning and representative transfer-based baselines, while maintaining robust generalization and favorable test-time scaling behavior.
[NLP-107] Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery Analysis and Mitigation
【速读】: 该论文旨在解决多语言神经机器翻译(NMT)模型中交叉注意力(cross-attention)分析中存在的系统性偏差问题,即非内容标记(如句尾符、语言标签和标点符号)占据了83%至91%的总注意力质量,导致原始注意力指标严重低估内容层面的相似性(原始为36.7%,过滤后达70.7%)。这种偏差源于词汇设计而非位置偏置,使未校正的注意力分析结果不可靠。解决方案的关键在于提出一种“仅保留内容标记”的过滤方法:移除非内容令牌并重新归一化注意力分布,从而恢复被掩盖的语言学信号,如教师强制与生成模式间的注意力差异、语言家族聚类特征及索马里语中SOV语序与单调对齐的隐藏关联。
链接: https://arxiv.org/abs/2605.01229
作者: Hillary Mutisya,John Mugane
机构: Harvard University (哈佛大学); Meta (Meta)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these “attention sinks,” extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal mechanism rooted in vocabulary design rather than position bias. This artifact causes raw metrics to underestimate content-level similarity by nearly half (36.7 percent raw vs. 70.7 percent filtered), rendering uncorrected analyses unreliable. To address this, we validate a content-only filtering methodology that removes non-content tokens and renormalizes the distribution. Applying this to 1,000 parallel sentences across African languages (Swahili, Kikuyu, Somali, Luo) and non-African benchmarks (German, Turkish, Chinese, Hindi), we confirm the artifact is universal and recover masked linguistic signals: a 16.9 percentage-point gap between teacher-forcing and generation modes, clear language-family clustering in attention entropy, and a hidden Somali paradox linking SOV word order to monotonic alignment. We release our filtering toolkit and corrected datasets to support reproducible interpretability research on multilingual NMT.
[NLP-108] Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLM s
【速读】: 该论文试图解决当前多语言自然语言处理(Natural Language Processing, NLP)中普遍存在的“偶然多语性”(incidental multilingualism)问题,即大型语言模型(Large Language Models, LLMs)看似具备多语言能力,实则源于训练数据的非均衡分布,而非系统性设计目标。这种范式导致模型在不同语言间表现不均、行为脆弱且缺乏透明度,尤其在涉及跨语言推理、规划与行动的实际部署场景中后果严重。论文的关键解决方案是提出“以设计为导向的多语言性”(multilingualism by design),将公平的多语言性能、文化语境嵌入以及跨语言行为理解作为模型开发全流程的核心目标,推动从数据驱动的被动多语性向目标导向的主动多语性转变。
链接: https://arxiv.org/abs/2605.01224
作者: Anjishnu Mukherjee,Chutong Meng,Antonios Anastasopoulos
机构: 未知
类目: Computation and Language (cs.CL)
备注: under review
Abstract:This paper argues that contemporary multilingual NLP has converged on a fragile and misleading paradigm of incidental multilingualism. Today’s LLMs appear multilingual largely because they are trained on massive, uneven web corpora, not because multilingual or multicultural competence has been treated as a core design objective. We contend that this paradigm systematically produces unequal, brittle, and opaque behavior across languages, with severe consequences in real-world and agentic deployments where models must reason, plan, and act across multiple linguistic contexts. We report a focused empirical study of two practical questions: which languages models self-report as supported and which languages they actually respond in across multilingual prompts. We additionally demonstrate how even a simple language-change attack can surface these failures and expose hidden assumptions about language in LLM-based systems. To address this, we call for a shift toward multilingualism by design: a research agenda that treats equitable multilingual performance, cultural grounding, and cross-lingual behavioral understanding as first-class goals in all aspects of the model pipeline.
[NLP-109] SRA: Span Representation Alignment for Large Language Model Distillation ACL2026
【速读】: 该论文旨在解决跨分词器知识蒸馏(Cross-Tokenizer Knowledge Distillation, CTKD)中因不同模型使用异构分词器而导致的token级对齐不稳定问题。现有方法主要依赖token-level对齐策略,但这类策略在面对分词差异时易失效。论文提出SRA(Span Representation Alignment)框架,其核心创新在于将对齐单位从脆弱的token升级为鲁棒的、与分词器无关的span(片段),并通过物理系统中的多粒子动力学视角建模:每个span被表示为其质心(Center of Mass, CoM),即注意力加权平均,从而捕获语义丰富信息;同时引入几何正则项保持表示空间结构完整性,并设计对齐span logits蒸馏机制提升跨架构知识迁移效果。此方案显著提升了CTKD在复杂场景下的稳定性和性能表现。
链接: https://arxiv.org/abs/2605.01205
作者: Quoc Phong Dao,Hoang Son Nguyen,Pham Khanh Chi,Tung Nguyen,Linh Ngo Van,Nguyen Thi Ngoc Diep,Trung Le
机构: Hanoi University of Science and Technology (河内科技大学); VNU University of Engineering and Technology (越南国家大学工程与技术学院); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbfSRA (\textbfSpan \textbfRepresentation \textbfAlignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.
[NLP-110] GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
【速读】: 该论文旨在解决当前过程奖励模型(Process Reward Models, PRMs)在非数学推理场景下误差检测能力不足的问题,尤其在科学和逻辑等多样化推理领域缺乏系统性评估基准。其解决方案的关键在于提出GR-Ben——一个专门设计用于评估PRM在科学与逻辑两大主类及九个子领域中过程级错误识别能力的基准测试集,并通过在22种不同模型(包括PRMs和大语言模型LLMs)上的广泛实验,揭示了现有PRMs在非数学推理任务中表现显著弱于数学推理任务,且普遍难以识别知识型错误,而LLMs则更难发现计算错误。这一基准的建立有望推动PRMs在通用推理场景下的优化,从而提升大语言模型的整体推理能力。
链接: https://arxiv.org/abs/2605.01203
作者: Zhouhao Sun,Xuan Zhang,Xiao Ding,Bibo Cai,Li Du,Kai Xiong,Xinran Dai,Fei Zhang,weidi tang,Zhiyuan Kan,Yang Zhao,Bing Qin,Ting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM’s performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational this http URL hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
[NLP-111] Compute Optimal Tokenization
【速读】: 该论文旨在解决现有语言模型训练中关于数据单位与模型规模之间 scaling law 关系的模糊认知问题,特别是 token 作为信息粒度对计算最优配置的影响尚未被充分探讨。其解决方案的关键在于构建并训练了988个潜在分词模型(latent tokenized models, BLT),这些模型可灵活设定压缩率(即每个 token 的平均字节数),从而系统性地研究压缩率在远超常用 BPE 分词器所得 4.57 字节/token 条件下的作用。实验发现,在计算最优配置下,模型参数量应与以字节为单位的数据量成比例增长,而非传统认为的以 token 数为单位;同时,最优压缩率随计算资源增加而降低,且该结论适用于潜在和子词分词方式以及非英语语言,为提升语言模型的计算效率提供了关键指导。
链接: https://arxiv.org/abs/2605.01188
作者: Tomasz Limisiewicz,Artidoro Pagnoni,Srini Iyer,Mike Lewis,Sachin Mehta,Alisa Liu,Margaret Li,Gargi Ghosh,Luke Zettlemoyer
机构: Meta(Meta)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.
[NLP-112] Quantifying and Predicting Disagreement in Graded Human Ratings LREC
【速读】: 该论文旨在解决人类标注者在对不当语言(如仇恨言论、攻击性语言和有毒语言)进行分级标注时存在的标注差异问题,尤其是探究这种差异是否可由文本特征预测。其核心解决方案在于提出“对立指数(Opposition Index)”,用于量化标注者在特定条目上的观点对立程度,并验证该指标在预测高分歧样本中的有效性。研究发现,标注方差与文本特征之间存在中等程度的正相关,且直接预测方差值与通过预测标注分布推导方差两种方法性能相当;同时,高对立指数的样本更难被模型准确预测,常被低估。
链接: https://arxiv.org/abs/2605.01168
作者: Leixin Zhang,Çağrı Çöltekin
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注: Accepted by the 5th Workshop on Perspectivist Approaches to NLP at LREC
Abstract:It is increasingly recognized that human annotators do not always agree, and such disagreement is inherent in many annotation tasks. However, not all instances in a given task elicit the same degree of opinion divergence. In this paper, we investigate annotation variation patterns in graded human ratings for inappropriate languages, including offensive language, hate speech, and toxic language perception. We examine whether the degree of annotation disagreement can be predicted from textual features. We further propose the Opposition Index, a metric that quantifies perspective opposition among annotators on a given item, and investigate the predictability of instances with potentially opposing human opinions. Our results show a moderate positive correlation between estimated and observed annotation variance. We find that two approaches achieve comparable performance in variance prediction: directly predicting the variance value and estimating it from predicted annotation distributions. Our results on opposition perspective prediction show that items with high opposition index values are more difficult to predict and are often underestimated by models.
[NLP-113] Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理具有周期性结构的概念(如月份、星期等)时,其内部计算机制是否直接利用了这些概念的几何特性(例如模12运算)的问题。研究发现,尽管模型对这类概念的表示呈现圆周结构(circular structure),但其推理过程并未直接执行与周期相关的模运算,而是通过一个通用加法机制进行计算:首先基于十进制数进行加法运算(如“六 + 八月 = 十四”),再将结果映射回周期性概念空间(如14 → 二月)。解决方案的关键在于识别出一种任务无关的傅里叶特征(task-agnostic Fourier features)机制,这些特征的周期(如2、5、10)遵循标准十进制加法规则,而非周期性概念本身的周期(如12)。进一步地,研究还发现仅约0.2%的MLP神经元(位于第18层)构成稀疏子集,可被划分为互不重叠的簇,每个簇负责计算某一特定周期的傅里叶特征,从而揭示了因果抽象(causal abstraction)与特征几何(feature geometry)之间的协同作用,为理解LLMs的内部工作机制提供了新的视角。
链接: https://arxiv.org/abs/2605.01148
作者: Sheridan Feucht,Tal Haklay,Usha Bhalla,Daniel Wurgaft,Can Rager,Raphaël Sarfati,Jack Merullo,Thomas McGrath,Owen Lewis,Ekdeep Singh Lubana,Thomas Fel,Atticus Geiger
机构: Northeastern University (东北大学); Technion IIT (以色列理工学院); Harvard University (哈佛大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., “what month is six months after August?”). Even though Llama-3.1-8B’s representations for these concepts are circularly structured, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using base-10 addition (six + August=14). Then, it maps this sum back to cyclic concept space (14-February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums–in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12 for months). Furthermore, we identify a sparse set of 28 MLP neurons re-used across all tasks (approximately 0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters, each computing the sum for a Fourier feature with a different period. Our work highlights how an interplay between causal abstraction and feature geometry can deepen our mechanistic understanding of LMs.
[NLP-114] When Less is Enough: Efficient Inference via Collaborative Reasoning
【速读】: 该论文旨在解决大模型端到端推理带来的高计算成本问题,即依赖单一大型模型完成全部推理与预测任务时,往往导致显著的推理开销。其解决方案的关键在于提出一种双阶段协同推理框架DUET(Dual-model Efficient Two-stage inference),通过将推理过程拆分为两个阶段:由能力较强的模型生成推理信号,再由轻量级模型基于该信号输出最终答案。这种分工机制使复杂推理任务交由大模型处理,而低推理强度的组件则由轻量模型承担,从而在不牺牲任务性能的前提下大幅降低整体推理成本。为实现高效信息传递,作者进一步设计了一种长度惩罚联合训练目标,促使强模型仅传输轻量模型完成任务所必需的信息,使得DUET在AIME和GPQA等复杂推理基准上相比纯大模型端到端推理最多节省60%的大模型输出token。
链接: https://arxiv.org/abs/2605.01111
作者: Yilei Chen,Sharut Gupta,Yannis Paschalidis,Ayush Sekhari,Aldo Pacchiano
机构: Boston University (波士顿大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Biohub (生物枢纽)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to perform end-to-end reasoning and prediction often incurs substantial inference cost. In contrast, DUET decomposes inference into two stages: the capable model produces a reasoning signal, and the lightweight model interprets this signal to generate the final answer, allowing reasoning-intensive computation to be handled by the capable model while non-reasoning-intensive components are delegated to the lightweight model without sacrificing task performance. To achieve this objective, we propose a length-penalized joint training objective that encourages the capable model to transmit only the information that is sufficient for the lightweight model to solve the task. As a result, DUET maintains strong reasoning performance with substantially lower inference cost than end-to-end inference using a large model alone, saving up to 60% of the large model’s output tokens on challenging reasoning benchmarks, including AIME and GPQA.
[NLP-115] Component-Aware Self-Speculative Decoding in Hybrid Language Models
【速读】: 该论文旨在解决自回归推理中效率低下的问题,传统方法依赖外部模型进行候选token的草稿生成与验证,而自适应推测解码(self-speculative decoding)试图利用模型内部结构实现零成本草稿生成。其关键创新在于提出组件感知的自适应推测解码(component-aware self-speculative decoding),首次利用混合语言模型(hybrid language models)内部架构异质性,将状态空间模块(SSM)或线性注意力子图作为无额外计算开销的内部草稿生成器。实验表明,混合架构的组件排列方式——即并行还是串行组合——显著影响推测接受率(acceptance rate),其中并行结构(如Falcon-H1)表现优异(α=0.68),而串行结构(如Qwen3.5)则极低(α=0.038),揭示了模型组成模式而非单一组件存在才是决定组件级自推测可行性的核心因素。
链接: https://arxiv.org/abs/2605.01106
作者: Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Institute of Materials Science of Valencia (瓦伦西亚材料科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 1 figure, 9 tables. Code: this https URL
Abstract:Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 – an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models – not merely the presence of alternative components – determines whether component-level self-speculation is viable.
[NLP-116] Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy
【速读】: 该论文旨在解决传统口吃(stuttering)评估与治疗规划流程中效率低、依赖人工且个性化不足的问题。其解决方案的关键在于构建一个基于智能体(agent-based)的虚拟言语治疗师(Virtual Speech Therapist, VST)平台,该平台通过深度学习驱动的口吃分类模型实现语音样本的自动化特征提取与类型识别,并引入多智能体大语言模型(multi-agent large language model, LLM)推理机制,使多个专业化LLM代理能够自主生成、批判性评估并迭代优化个体化治疗方案。其中,专用评判代理确保所有生成方案符合临床安全性、方法学严谨性和循证医学标准,最终输出供临床医生审核的定制化治疗草案,形成“临床医生在环”(clinician-in-the-loop)的闭环工作流,显著提升治疗规划的效率与质量。
链接: https://arxiv.org/abs/2605.01101
作者: Shakeel Sheikh,Patrick Marmaroli,MD Sahidullah,Slim Ouni,Fabrice Hirsch,Goncalo Leal,Bjorn W Schuller
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Under Review
Abstract:This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system’s potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: this https URL , facilitating real-time stuttering assessment and personalized therapy planning.
[NLP-117] Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues
【速读】: 该论文旨在解决对话式知识追踪(Conversational Knowledge Tracing, CKT)中缺乏对学生能力与题目难度显式建模的问题,以及现有方法依赖黑箱式大语言模型(LLM)潜在表示导致预测不可解释的局限性。其解决方案的关键在于构建一个可解释的、难度感知的对话式知识追踪框架,该框架通过融合原始文本问题和教师后续提问内容来估计学生知识状态,并引入项目反应理论(Item Response Theory, IRT)将LLM输出映射为学生能力与题目难度参数,从而实现基于认知学习理论的可解释性能预测。
链接: https://arxiv.org/abs/2605.01097
作者: Shuyan Huang,Alexander Scarlatos,Jaewook Lee,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:Recent advances in large language models (LLMs) have led to the development of AI-powered tutoring systems that provide interactive support via dialogue. To enable these tutoring systems to provide personalized support, it is essential to assess student performance at each turn, motivating knowledge tracing (KT) in dialogue settings. However, existing dialogue-based KT approaches often ignore question difficulty modeling and rely on opaque latent representations from LLMs, hindering accurate and interpretable prediction. In this work, we propose an interpretable difficulty-aware conversational KT framework built upon LLMs, which explicitly models students’ abilities and the difficulty of tutor-posed tasks at each turn. The framework incorporates the original textual question and the next tutor-posed task to estimate the student’s knowledge state and the difficulty of the upcoming turn. Furthermore, it integrates Item Response Theory to map LLM’s outputs into student ability and question difficulty parameters, enabling interpretable prediction of student performance grounded in cognitive theories of learning. We evaluate the framework on two tutor-student dialogue datasets. Both quantitative and qualitative results show that our framework outperforms existing KT baselines, meanwhile generating interpretable outputs consistent with cognitive theory.
[NLP-118] aching LLM s Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在巴西临床指南特定知识上的表现不佳问题,尤其是在巴西葡萄牙语语境下缺乏专门评估基准的现状。针对这一挑战,作者通过将Qwen2.5-14B-Instruct模型适配至巴西临床领域,利用178份官方指南(约540万token)生成约7000万token的合成数据(包括重述、维基风格文章和问答对),并采用持续预训练与组相对策略优化(GRPO)进行微调。其关键创新在于:一是使用多种生成器提升合成数据多样性,二是引入强化学习机制优化模型对临床事实的准确召回能力。实验表明,该方法在自建的HealthBench-BR和PCDT-QA两个基准上分别达到83.9%和85.4%的准确率,显著优于多个主流大模型,且仅用14B参数即实现超越更大规模模型的效果。
链接: https://arxiv.org/abs/2605.01077
作者: Hugo Abonizio,Filipe Rocha Lopes,Roberto Lotufo,Rodrigo Nogueira
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Brazil’s Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats – rephrases, wiki-style articles, and question-answer pairs – using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview’s web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at this https URL
[NLP-119] Controlled Paraphrase Geometry in Sentence Embedding Space: Local Manifold Modeling and Latent Probing
【速读】: 该论文旨在解决句子嵌入空间中语义相近句子所构成的嵌入云(embedding clouds)的局部几何结构如何组织的问题,特别是控制性类比语义变化(controlled paraphrase-like semantic variation)在嵌入空间中的分布特性及其是否可通过低阶拟合模型显式建模。其解决方案的关键在于提出一种基于仿射、二次和三次拟合模型的局部几何建模框架,并引入一种基于曲面的潜在探测方法(surface-based latent probing),该方法在降维后的局部主成分分析(PCA)空间中构造合成潜在点,以实现对局部流形结构的显式建模与几何感知的潜在空间探测。实验表明,非线性局部模型比仿射模型更准确地描述嵌入云,且表面生成方法在拟合几何保真度(包括曲面一致性、基于Hessian的形状一致性及系数一致性)方面表现优异;但几何有效性并不自动转化为下游分类性能提升,强调了区分几何合理性与判别效用的重要性。
链接: https://arxiv.org/abs/2605.01073
作者: Leonid Bedratyuk
机构: Khmelnytsky National University (赫梅利尼茨基国立大学)
类目: Computation and Language (cs.CL)
备注: 45 pages
Abstract:The paper studies the local geometry of embedding clouds induced by \emphcontrolled local classes of semantically close sentences. The central question is how controlled paraphrase-like semantic variation is organized in sentence embedding space and whether this local structure can be explicitly modeled by low-degree fitted carriers. We introduce a local geometric modeling scheme based on affine, quadratic, and cubic fitted models. We also use a surface-based latent probing procedure that constructs synthetic latent points in a reduced local PCA space with respect to the fitted carrier. The procedure is intended as an offline method for representation-space analysis, local manifold modeling, and geometry-aware latent probing. Generated latent points are evaluated using criteria that measure consistency with the fitted surface, preservation of neighborhood structure, agreement with the empirical distribution, stability of Hessian-based second-order shape descriptors, and stability of fitted-model coefficients. Experiments on controlled sets of semantically close sentences show that nonlinear local models describe embedding clouds more accurately than affine models. Surface-based generation provides strong fitted-geometry fidelity, including surface consistency, Hessian-based shape consistency, and coefficient consistency. Downstream experiments show that geometric validity of synthetic latent points does not automatically translate into improved classification performance. The results support explicit local geometric modeling of sentence embedding space and highlight the need to distinguish geometric validity from discriminative utility. As a resource contribution, we introduce \textbfCoPaGE-300K, a controlled template-based dataset of semantically close sentence variants with slot-level annotations and precomputed sentence embeddings. Comments: 45 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.01073 [cs.CL] (or arXiv:2605.01073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.01073 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Leonid Bedratyuk [view email] [v1] Fri, 1 May 2026 20:12:06 UTC (967 KB)
[NLP-120] A Systematic Exploration of Text Decomposition and Budget Distribution in Differentially Private Text Obfuscation
【速读】: 该论文旨在解决差分隐私(Differential Privacy, DP)条件下文本混淆(text obfuscation)中的隐私-效用权衡问题,即如何在保障隐私的前提下最大程度保留文本的语义完整性。其关键解决方案在于系统性地评估多种文本分解(text decomposition)与隐私预算分配(budget distribution)策略的组合效果,发现即使在相同总体隐私预算 ε 下,不同的分块方法与 ε 分配机制会显著影响最终输出文本的质量与隐私保护强度,从而为优化差分隐私文本混淆流程提供了实证依据和设计指导。
链接: https://arxiv.org/abs/2605.01065
作者: Stephen Meisenbacher,Angelo Kleinert,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 5 figures, 12 tables. Accepted to PrivateNLP 2026
Abstract:The goal of differentially private text obfuscation is to obfuscate, or “perturb”, input texts with Differential Privacy (DP) guarantees, such that the private output texts are quantifiably indistinguishable from the originals. While perturbation at the word level is intuitive, meaningful text privatization happens on complete documents. Recent research has laid the groundwork for reasoning about privacy budget distribution, namely, how an overall \varepsilon budget can be sensibly distributed among the component pieces of a text. We perform a systematic evaluation of multiple text decomposition and budget distribution techniques in the context of DP text obfuscation, testing how different methods for chunking texts can be combined with techniques for allocating \varepsilon to these chunks. Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.
[NLP-121] LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference ACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)中Transformer模型推理效率优化的两个主流方法——层对齐蒸馏(layer-aligned distillation)与基于收敛性的早期退出(convergence-based early exit)之间的系统性不兼容问题。研究表明,在标准部署条件下,传统蒸馏目标会抑制中间层表示的收敛特性,从而使得早期退出机制在蒸馏模型上失效。解决方案的关键在于提出一种无需架构修改的辅助训练目标LEAP(Layer-wise Exit-Aware Pretraining),其通过引入一个单一约束,确保学生模型的中间层近似于最终层表示,从而在保持蒸馏性能的同时恢复早期退出的有效性。实验表明,LEAP-MiniLM在批处理大小为1、NVIDIA L4硬件环境下实现了1.61倍的实际运行时加速(θ=0.95),且91.9%的样本能在第7层前退出,理论层数减少达1.80倍,显著优于标准蒸馏模型零有效加速的结果。
链接: https://arxiv.org/abs/2605.01058
作者: Shashank Kapadia,Deep Naryan Mishra,Sujal Reddy Alugubelli,Haoan Wang,Saipraveen Vabbilisetty,Rishi Bhatia,Anupriya Sharma
机构: Walmart Inc.(沃尔玛公司); Sam’s Club (山姆会员商店)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Industry Track). 14 pages, 5 figures
Abstract:Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit systematic incompatibility under standard deployment conditions for convergence-based early exit. Distillation objectives that align intermediate student layers to teacher representations suppress the representational convergence that early-exit mechanisms exploit, rendering such mechanisms ineffective on distilled models. We introduce LEAP (Layer-wise Exit-Aware Pretraining), an auxiliary training objective that reconciles this incompatibility. LEAP requires no architectural modifications; it augments standard distillation with a single constraint ensuring intermediate layers approximate final-layer representations. LEAP-MiniLM achieves 1.61 \times measured wall-clock speedup (batch=1, NVIDIA L4) at \theta =0.95, with 91.9% of samples exiting by layer 7 and 1.80 \times theoretical layer reduction, where standard distilled models achieve zero effective speedup. We validate across sentence similarity (STS-B: 0.760 \pm 0.006) and retrieval benchmarks (BEIR), providing operational guidance including latency measurements, decision thresholds, and deployment criteria.
[NLP-122] Compared to What? Baselines and Metrics for Counterfactual Prompting
【速读】: 该论文旨在解决当前基于反事实提示(counterfactual prompting)的模型评估方法中存在的因果推断偏差问题,即在测量特定变量(如患者性别)对大语言模型(LLM)输出影响时,未能排除由文本表面形式变化引发的普遍模型敏感性干扰,从而导致错误归因。其解决方案的关键在于提出一个对比框架:通过统计检验比较目标干预(target intervention)与“语义保持”式改写(paraphrasing)所诱发的输出差异,以识别出真正由目标变量驱动的效应,而非混杂的表层变化所致。该框架有效区分了特异性敏感性和一般性敏感性,在MedPerturb数据集上的复现分析显示,多数原报告的敏感性效应在控制基线后不再显著,而职业传记分类任务中仍检测到显著的性别偏向,验证了该方法对微小但方向明确效应的识别能力。
链接: https://arxiv.org/abs/2605.01048
作者: Zihao Yang,Mosh Levy,Yoav Goldberg,Byron C. Wallace
机构: Northeastern University (东北大学); Bar-Ilan University (巴伊兰大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 10 figures. Under review
Abstract:Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving’’ modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics – aggregate, per-sample distributional, and regression – and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.
[NLP-123] LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中频繁出现的幻觉问题,特别是模型会虚构不存在的软件包并推荐其导入和安装命令,从而引发“包混淆攻击”(slopsquatting),即攻击者注册这些虚假包以植入恶意载荷。传统方法要么导致模型性能严重下降,要么依赖预设的遗忘集(forget-set),难以应对无限扩展的幻觉空间。解决方案的关键在于提出一种后部署框架——自适应遗忘(Adaptive Unlearning, AU),其核心是引入一种混合的词元级目标函数,同时强化有效输出并抑制幻觉内容,并结合一个无需人工监督的自适应发现循环,持续识别新的诱发幻觉的上下文。该方法实现了对未见过提示和幻觉类型的泛化能力,在显著降低81%的包幻觉率的同时,保持了标准编程基准上的性能,且分布变化集中于包相关生成,不影响整体编码行为,具备跨领域泛化能力。
链接: https://arxiv.org/abs/2605.01047
作者: Joseph Spracklen,Pedram Aghazadeh,Farinaz Koushanfar,Murtuza Jadliwala
机构: University of Texas San Antonio (德克萨斯大学圣安东尼奥分校); University of California San Diego (加州大学圣地亚哥分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU’s effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.
[NLP-124] A Theoretical Game of Attacks via Compositional Skills
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因对抗性提示(adversarial prompts)导致的安全风险问题。现有对齐策略虽能限制有害行为,但仍易被精心设计的攻击绕过。论文提出一个理论框架,将攻击者与防御者之间的博弈形式化,并据此设计出理论上的最优响应攻击策略,该策略与多种现有对抗提示方法密切相关;进一步分析表明,攻击方在该博弈中具有内在优势,并由此推导出一个理论上最优的防御策略。其核心贡献在于通过博弈论建模揭示了攻击-防御关系的本质结构,并提供了可证明最优的攻防策略,从而为提升LLM安全性提供了坚实的理论基础与实践指导。
链接: https://arxiv.org/abs/2605.01034
作者: Xinbo Wu,Huan Zhang,Abhishek Umrawal,Lav R. Varshney
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2505.20841
Abstract:As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.
[NLP-125] Psychologically Potent Computationally Invisible: LLM s Generate Social-Comparison Triggers They Fail to Detect
【速读】: 该论文旨在解决文本中社会比较(Social Comparison)信号的检测问题,特别是针对小红书(Xiaohongshu)平台上的纯文本帖子是否引发读者从第一人称视角感知到向上(UPWARD)、向下(DOWNWARD)或中性/无明确社会比较的问题。这一信号具有行为相关性但无法简化为情感极性,因此对理解用户心理和社会互动机制至关重要。解决方案的关键在于构建一个基于读者视角的基准数据集——XHS-SCoRE,并通过对比提示型大语言模型(Prompted LLMs)与监督式中文编码器基线模型的表现,揭示生成流畅性与可靠检测能力之间的系统性不匹配现象;同时提出诊断框架,用于识别在何种情况下社会意义显著的关系线索仅部分可见于基于提示的推理过程。
链接: https://arxiv.org/abs/2605.01017
作者: Hua Zhao,Jiapei Gu,Michelle Mingyue Gu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, preprint
Abstract:We introduce Xiaohongshu Social Comparison Reader Elicitation (XHS-SCoRE), a reader-grounded benchmark for detecting if a text-only Xiaohongshu (RedNote) post elicits UPWARD, DOWNWARD, or NEUTRAL/no clear social comparison from a first-person reader perspective. The task targets a socially meaningful relational signal that is behaviorally real yet not reducible to sentiment. Across prompted LLM classifiers and supervised Chinese encoder baselines, we find a consistent mismatch between generation fluency and reliable detection ability: the signal is textually learnable in-domain, but not robustly accessible to prompt-based classification. Prompted LLM classifiers exhibit stable, interpretable failure modes, especially neutralization of comparison-triggering posts and model-specific directional skew. A controlled pilot further shows that LLM-generated Xiaohongshu-style posts can shift perceived standing and comparison-related affect even when prompt-based detection of the same construct remains fragile. XHS-SCoRE contributes both a benchmark for reader-grounded comparison detection and a diagnostic framework for studying when socially meaningful relational cues remain only partially visible to prompt-based inference.
[NLP-126] CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLM s for Medicine
【速读】: 该论文旨在解决当前医疗大语言模型(Medical Large Language Model, LLM)评估中普遍存在的问题:现有基准测试多采用简化、考试式的题型,难以反映真实临床场景中医学问题的模糊性和不确定性。为更准确地衡量LLM在复杂决策空间中的可靠性,作者提出CLinical Evaluation of Ambiguity and Reliability (CLEAR)框架,其关键在于系统性扰动三个维度:(1)合理答案选项的数量;(2)是否存在“正确答案”或“拒答”选项;(3)答案选项的语义表述方式(如从“以上都不是”到“I don’t know”)。通过该框架对17个LLM在三个基准上的测试,揭示了现有评估方法的三大局限:模型在面对更多合理选项时准确性下降,且对错误答案的拒答能力减弱;当拒答选项由断言式转为不确定表达时,错误选择显著增加;此外,模型规模扩大反而加剧了“谦逊缺口”(humility deficit),即识别正确答案与拒答错误答案之间的性能差距。这一发现表明,单纯扩大模型规模无法解决LLM在医疗场景下的可靠性问题。
链接: https://arxiv.org/abs/2605.01011
作者: Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Goa,Juming Xiong,Zhijun Yin,Bradley A. Malin
机构: Vanderbilt University Medical Center (范德比尔特大学医学中心); Vanderbilt University (范德比尔特大学); Intuit AI Research (Intuit AI 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs’ reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model’s ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like “None of the Above” to uncertainty admission like “I don’t know” (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.
[NLP-127] Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLM s Overestimate Their Own Effectiveness
【速读】: 该论文旨在解决政治极化背景下,党派性新闻媒体对跨党派信任的侵蚀问题,探索生成式 AI(Generative AI)是否能通过内容去偏(debiasing)提升保守派读者对自由派新闻标题的信任相关判断。其解决方案的关键在于区分不同层级的去偏策略:研究发现,仅对情绪化词汇进行细微替换(lexical debiasing)无效,而基于意识形态框架的实质性重述(substantive reframing)显著提升了保守派读者对自由派新闻的可信度、完整性和参与意愿,且未引发自由派群体的反弹效应。这表明,针对意识形态语义结构而非表面语言的干预更为有效,但当前大语言模型在量化精度和心理机制拟合度上仍不足,需人类监督以确保干预效果的可靠性。
链接: https://arxiv.org/abs/2605.01006
作者: Faisal Feroz,Jonas R. Kunst
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Partisan news media erode cross-partisan trust, but large language models (LLMs) offer a potential means of debiasing such content at scale. Across two pre-registered experiments, we tested whether LLM-generated debiasing of liberal news headlines could improve conservative readers’ trust-relevant judgments. Study 1 found that subtle lexical debiasing (replacing emotive words with more moderate synonyms) had no effect on any outcome. Study 2 found that a more substantive reframing intervention significantly increased conservatives’ perceived trustworthiness, completeness, and willingness to engage with liberal news headlines, without producing a backfire effect among a sample of liberals. In Study 1, the intervention produced robust effects among LLM-simulated silicon participants, whereas it had no impact on human readers. In Study 2, the intervention’s effects among silicon participants aligned directionally with human responses but were significantly larger in magnitude for some outcomes. Moderation analyses revealed that the model’s implicit theory of who responds to debiasing diverged from the psychological profile that actually predicted human responsiveness. These findings demonstrate that LLM-based debiasing can improve cross-partisan receptivity when targeting ideological framing rather than surface-level language, but that current models lack both the quantitative accuracy and qualitative psychological fidelity to evaluate their own interventions without human oversight.
[NLP-128] Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
【速读】: 该论文旨在解决如何在不依赖模型内部结构或预先假设行为特征的情况下,识别经过微调(finetuning)的大语言模型中潜在的有害或非预期行为的问题。其核心挑战在于,这些微调行为可能隐藏在模型输出中,难以通过常规方法发现。解决方案的关键在于利用微调模型对特定任务过度泛化(overgeneralization)的特性:通过生成大量由随机前缀引导的补全文本,并基于参考模型与微调模型之间的困惑度(perplexity)差异进行排序,从而筛选出最能揭示微调目标的高置信度补全结果。该方法无需访问原始预训练检查点,仅需微调模型的下一个词概率分布,即可有效识别多种类型模型生物(model organisms)中的微调意图,包括植入后门、内化虚假事实、对抗训练下的隐蔽不良行为等,且对API受限模型同样适用。
链接: https://arxiv.org/abs/2605.00994
作者: Mohammed Abu Baker,Luca Baroni,Dan Wilhelm
机构: Meridian Impact CIC (梅里迪安影响CIC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior. We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible. We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.
[NLP-129] Democratizing the medieval English legal tradition ICDAR
【速读】: 该论文旨在解决中世纪拉丁文手稿中法律文本难以阅读和数字化的问题,这些手稿是英美法系最早的法律记录,但因高度缩写且仅少数学者能解读,严重限制了学术研究的可及性。解决方案的关键在于构建一个包含193个刑事与民事案例共4029行文本的标注数据集,并开发一个开源端到端的转录管道,其中核心步骤包括:使用标准神经网络架构(R-Blla用于行分割、CNN+LSTM结合CTC解码用于手写识别)实现初步高精度转录(达79%词准确率),通过引入n-gram语言模型优化CTC解码进一步提升至82%,再借助生成式AI(Gemini Pro 3)进行后处理纠错,最终将词准确率提高到88%;同时对比TrOCR(基于Transformer的OCR模型)发现其虽词准确率相当,但字符准确率较低,因其过度猜测导致人类难以判断正确读音。该方法显著提升了历史法律文献的自动化转录能力,为法律史和中世纪研究提供了开放平台支持。
链接: https://arxiv.org/abs/2605.00977
作者: Michael Zhang,Elise Wang,Charlotte Whatley,Seth Strickland,Dylan Bannon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to International Conference on Document Analysis and Recognition (ICDAR) 2026
Abstract:The record of the beginning of the most widespread legal system in the world is contained in millions of pages of handwritten text. Most of the records of the first centuries of the Anglo-American legal system are hand-written in a highly abbreviated form of medieval Latin which only a few dozen scholars in the world are trained to read. In this interdisciplinary project, we construct a dataset of 4029 lines of text across 193 medieval criminal and civil cases. We then use the dataset to train an open-source end-to-end pipeline for transcribing these manuscripts. We first train standard neural network architectures for line segmentation and handwriting recognition (R-Blla and CNN+LSTM with CTC decoding, respectively) and show that they can already achieve 79% word accuracy, despite the relatively small training set and the challenge of expanding abbreviations. We then demonstrate that simple post-processing significantly boosts accuracy: adding an n-gram language model to the CTC decoder improves word accuracy to 82%, while asking Gemini Pro 3 to correct mistakes boosts accuracy to 88%. Finally, we compare the CNN+LSTM architecture with TrOCR, a transformer-based OCR architecture, demonstrating that TrOCR shows comparable word accuracy but worse character accuracy due to its over-willingness to guess, making it harder for humans to infer the correct reading. We incorporated our pipeline into a web portal (this http URL), opening up the English legal tradition to legal scholars, medievalists, and students.
[NLP-130] SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型在安全对齐机制下仍易受 jailbreaking 攻击的问题,尤其是现有自动化攻击方法缺乏对成功与失败经验的系统性利用,以及缺少在不同约束条件下可复用攻击规则的合理组合与选择机制,导致攻击策略难以积累可迁移知识并适应目标模型和不断演进的安全机制。解决方案的关键在于提出一种无需参数更新的自进化规则驱动型无训练劫持框架(Self-Evolving Rule-Driven Training-Free Jailbreak, SRTJ),其核心是将基于经验的攻击生成与基于答案集编程(Answer Set Programming, ASP)的规则选择及约束感知的规则组合相结合,并通过迭代验证器反馈联合优化有效攻击策略并分析失败模式,从而实现规则记忆的分层演化——明确区分长期、中期与短期规则,以捕捉稳定可迁移的攻击策略与瞬态适应行为,在探索与利用之间取得平衡,显著提升攻击性能的稳定性、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2605.00974
作者: Jindong Li,Ying Liu,Yali Fu,Jinjing Zhu,Leyao Wang,Menglin Yang,Rex Ying
机构: HKUST (GZ); Jilin University; Yale University
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework that systematically discovers, composes, and refines attack strategies through interaction and feedback, without updating model parameters. Specifically, SRTJ couples experience-driven attack generation with answer set programming (ASP)-based rule selection and constraint-aware composition, where iterative verifier feedback is leveraged to jointly refine successful strategies and analyze failure patterns. The resulting rule memory evolves in a hierarchical multi-level manner, explicitly organizing distilled attack knowledge into long-term, middle-term, and short-term rules, thereby capturing both stable transferable strategies and transient adaptive behaviors to effectively balance exploration and exploitation across attack attempts. Extensive experiments on mainstream jailbreak benchmark (HarmBench) demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, while exhibiting improved robustness and generalization compared to existing jailbreak methods. The code is available at this https URL.
[NLP-131] MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio ICML2026
【速读】: 该论文旨在解决当前医疗音频问答(Medical Audio Question-Answering)领域中因数据稀缺与标注成本高而导致的模型评估不足问题,尤其是现有基准测试难以覆盖复杂临床场景的局限性。其解决方案的关键在于构建MedMosaic数据集,该数据集包含多样化的真实与合成医疗音频类型(如生理音、带伪影的语音及长短不一的临床对话),并提供46,701个涵盖多选题、多轮交互和开放式问答的高质量标注样本,从而系统性地评估多跳推理和答案生成能力,为医疗领域多模态推理模型的发展提供了更具现实约束的基准。
链接: https://arxiv.org/abs/2605.00969
作者: Harshit Rajgarhia,Shuubham Ojha,Asif Shaik,Akhil Pothanapalli,Rachuri Lokesh,Abhishek Mukherji,Prasanna Desikan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICML 2026. 12 pages main text, 35 pages appendix, 5 figures, 7 tables
Abstract:We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.
[NLP-132] Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities
【速读】: 该论文旨在解决跨模态结构一致性建模问题,即如何在不依赖特定领域标注数据的前提下,显式地学习并量化输入数据内部的结构性合理性(structural coherence),从而检测诸如文本中的语法错误或图像中的伪造痕迹等违反结构规律的异常。其核心解决方案是提出能量约束网络(energy-based constraint networks),该架构通过冻结预训练编码器(如BERT或DINOv2)提取特征后,利用具有双头注意力机制的状态空间模型(state-space model)输出一个标量能量值(衡量整体结构一致性)以及每个位置的能量得分(用于定位具体违规点)。关键创新在于:1)首次将模态内结构一致性建模为显式的能量景观(energy landscape),并实现逐位置分解;2)采用可组合的分支结构(composable branches),各分支独立训练以识别不同类型的违规模式,在推理时无需相互干扰;3)通过仅调整污染策略(corruption respecification)即可实现跨模态迁移(如从文本到视觉),且对编码器和任务域均保持高度解耦性。实验表明,该方法在文本和视觉任务中均取得显著性能,且参数效率高、泛化能力强。
链接: https://arxiv.org/abs/2605.00960
作者: Chirag Shinde
机构: Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 16 pages, 3 figures, 11 tables. Code: this https URL Weights: this https URL
Abstract:We introduce energy-based constraint networks – a modality-agnostic architecture that learns structural coherence from contrastive pairs. The system processes frozen encoder embeddings through a state-space model with dual-head attention, producing a scalar energy measuring structural consistency alongside per-position energy scores that localize violations. Multiple independently trained branches detect different violation types and compose at inference without interference. We demonstrate the framework in two domains. In text, the system achieves 93.4% accuracy on trained corruption types and 87.2% on 9 unseen types, using frozen BERT and 7.4M trainable parameters. In vision, the same architecture achieves competitive deepfake detection: 0.959 AUC on FaceForensics++ Deepfakes and 0.870 on Celeb-DF without any Celeb-DF training data, using frozen DINOv2 and 3.6M parameters per branch. The framework supports flexible training: branches learn from designer-specified corruptions, real-world paired data, or both. Composable branches require representation compatibility – a finding validated through extensive experimentation where five incompatible approaches failed before the compatible one succeeded. The architecture is encoder-agnostic and domain-agnostic: changing the domain requires only new corruption strategies; changing the encoder requires only a new input projection layer. To our knowledge, this is the first architecture to learn within-modality structural coherence as an explicit energy landscape with per-position decomposition, and to demonstrate that the same architecture transfers across modalities via corruption respecification alone. Comments: 16 pages, 3 figures, 11 tables. Code: this https URL Weights: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2605.00960 [cs.CV] (or arXiv:2605.00960v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-133] DIAGRAMS: A Review Framework for Reasoning -Level Attribution in Diagram QA
【速读】: 该论文旨在解决Diagram QA(图表问答)任务中结构化证据标注效率低下的问题,即如何高效地为每个问题-答案对关联到推理过程中所需的所有视觉区域(而非仅包含最终答案的区域),从而实现推理级归因(reasoning-level attribution)。现有标注工具因与特定数据集格式强耦合而难以复用,导致人工标注成本高。解决方案的关键在于提出DIAGRAMS框架——一个轻量级、基于模式驱动的审查流程,通过内部元模式(meta-schema)和数据集适配器将界面逻辑与数据集特定的JSON结构解耦;该框架能基于问题条件进行证据选择,并在缺乏QA对或候选区域时自动生成建议,支持人工验证与修正,从而显著减少手动区域创建工作量,同时保持与人工最终归因的高度一致性(平均精确率达85.39%,召回率达75.30%)。
链接: https://arxiv.org/abs/2605.00905
作者: Anirudh Iyengar Kaniyar Narayana Iyengar,Tampu Ravi Kumar,Manan Suri,Raviteja Bommireddy,Dinesh Manocha,Puneet Mathur,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Adobe Research (Adobe 研究院); IIITDM (印度信息技术研究所德干分校); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, 4 figures
Abstract:Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present DIAGRAMS, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces manual region creation while maintaining high agreement with final reasoning-level attributions. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.
[NLP-134] OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
【速读】: 该论文旨在解决海洋科学领域中人工智能应用受限的问题,其核心瓶颈在于海洋数据的碎片化、多模态特性、高噪声及弱标注等特征导致缺乏统一的数据 schema 和语义对齐,从而制约了多模态大语言模型(Multimodal Large Language Models, MLLMs)在该领域的有效部署。解决方案的关键在于构建 OceanPile——一个面向海洋基础模型的大规模多模态语料库,包含三个核心组件:OceanCorpus(整合声呐数据、水下图像、海洋科学可视化内容与权威文本的统一数据集)、OceanInstruction(基于层级化海洋概念知识图谱生成的高质量指令数据集)以及 OceanBenchmark(人工精心构建的评估基准),并通过多阶段质量控制流程保障跨模态一致性与科学准确性,实验证明该数据集可显著提升模型性能,推动海洋人工智能的发展。
链接: https://arxiv.org/abs/2605.00877
作者: Yida Xue,Ningyu Zhang,Tingwei Wu,Zhe Ma,Daxiong Ji,Zhao Wang,Guozhou Zheng,Huajun Chen
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. ByteDance (字节跳动); 4. National University of Singapore (新加坡国立大学); 5. Peking University (北京大学); 6. Shanghai AI Laboratory (上海人工智能实验室)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work in progress
Abstract:The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.
[NLP-135] H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行需要层次推理的任务时,其潜在表示中如何几何地编码层次结构这一问题。现有研究虽表明LLMs具备处理多种层次推理任务的能力,但对其内部表征中层次结构的具体存在形式与功能重要性缺乏系统分析。解决方案的关键在于提出了一种名为H-probes的线性探测工具集,能够从模型的隐层表示中提取深度和成对距离等层次结构信息;实验表明,这些探测器可有效识别出低维、因果关键且跨域泛化的子空间,从而揭示模型不仅在语法和概念层面,更在推理过程本身这一抽象层级上编码了层次结构。
链接: https://arxiv.org/abs/2605.00847
作者: Cutter Dawes,Aryan Sharma,Angelos Ioannis Lagos,Shivam Raval
机构: Supervised Program for Alignment Research; Yale University; Harvard University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Representing and navigating hierarchy is a fundamental primitive of reasoning. Large language models have demonstrated proficiency in a wide variety of tasks requiring hierarchical reasoning, but there exists limited analysis on how the models geometrically represent the necessary latent constructions for such thinking. To this end, we develop \textitH-probes, a collection of linear probes that extract hierarchical structure, specifically depth and pairwise distance, from latent representations. In synthetic tree traversal tasks, the H-probes robustly find the subspaces containing hierarchical structure necessary to complete the tasks; furthermore, in comprehensive ablation experiments, we show that these hierarchy-containing subspaces are low-dimensional, causally important for high task performance, and generalize within- and out-of-domain. Furthermore, we find analogous, though weaker, hierarchical structure in real-world hierarchical contexts such as mathematical reasoning traces. These results demonstrate that models represent hierarchy not only at the level of syntax and concepts, but at deeper levels of abstraction – including the reasoning process itself.
[NLP-136] Graph Query Generation with Constraint-guided Large Language Agents ICDE
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中针对非RDF/SPARQL查询语言(如Cypher)支持不足的问题,尤其是在工业场景下对统一多语言KGQA的需求日益增长的背景下。现有方法大多聚焦于RDF/SPARQL,难以适配属性图(property graph)和实际企业级部署环境。解决方案的关键在于提出UniQGen框架,其核心是基于约束驱动的LLM代理机制:通过改进Chase-Backchase算法,在动态推理过程中结合大语言模型(Large Language Models, LLMs)对查询约束进行交互式优化与质量评估,从而将自然语言问题映射为跨查询语言(如Cypher)的可执行、意图对齐的图查询语句。该方法无需针对特定schema进行微调,具备良好的扩展性,适用于无模式图结构及多样化的查询语义场景。
链接: https://arxiv.org/abs/2605.00845
作者: Mengying Wang,Nicolaas Jedema,Rahul Pandey,RaviKiran Krishnan,Jens Lehmann,Yinghui Wu
机构: Amazon AGI(亚马逊AGI); Stanford University (斯坦福大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 42nd IEEE International Conference on Data Engineering (ICDE)
Abstract:Knowledge Graph Question Answering (KGQA) has advanced through structured query generation, yet most efforts target RDF/SPARQL, leaving Cypher and property graphs underexplored, despite increasing demand for unified KGQA in industry settings. We propose UniQGen, a novel constraint-based framework that employs LLM agents to dynamically extract and refine representative graph query clauses into executable, intent-aligned graph queries across query languages. The foundation of our method is a variant of Chase Backchase, a family of algorithms for query optimization and reformulation. We extend Chase Backchase with a dynamic reasoning process over query constraints that also interact with LLMs for query quality estimation. With a Cypher-supported Freebase graph deployed on Amazon Neptune, we extensively evaluate our approach on popular KGQA benchmarks (GraphQ, GrailQA, and WebQSP). We demonstrate that UniQGen outperforms state-of-the-art graph query generation techniques in both accuracy and efficiency, with F1 gains of 31.6% on GraphQ and 4.9% on GrailQA. Unlike prior methods, our framework does not require fine-tuning for schema matching, making it more extensible to schema-less graphs and semantics in query workloads, and is more suitable for enterprise-grade KGQA. We release Cypher outputs and a Neptune-ready Freebase snapshot to support reproducible, cross-language KGQA research.
[NLP-137] Prefix Parsing is Just Parsing ACL2026
【速读】: 该论文旨在解决前缀解析(prefix parsing)问题,即判断一个输入前缀是否可以扩展为由给定文法生成的完整字符串,并在加权场景下计算前缀概率,这对上下文无关语言建模、心理语言学分析及大语言模型的句法约束生成具有重要意义。解决方案的关键在于提出前缀文法变换(prefix grammar transformation),通过构造一个新文法,使其恰好生成原文法所有字符串的前缀,从而将前缀解析问题转化为标准解析问题,无需修改现有解析算法即可直接利用优化实现;同时引入基于算法微分(algorithmic differentiation)的策略高效计算下一词元权重向量(next-token weight vector),实现对所有单标记扩展的前缀权重快速预测,最终构建了一个简洁、通用且高效的前缀解析框架。
链接: https://arxiv.org/abs/2604.21191
作者: Clemente Pasti,Andreas Opedal,Timothy J. O’Donnell,Ryan Cotterell,Tim Vieira
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: To appear at ACL 2026
Abstract:Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling, psycholinguistic analysis, and syntactically constrained generation from large language models. We introduce the prefix grammar transformation, an efficient reduction of prefix parsing to ordinary parsing. Given a grammar, our method constructs another grammar that generates exactly the prefixes of its original strings. Prefix parsing is then solved by applying any ordinary parsing algorithm on the transformed grammar without modification. The reduction is both elegant and practical: the transformed grammar is only a small factor larger than the input, and any optimized implementation can be used directly, eliminating the need for bespoke prefix-parsing algorithms. We also present a strategy-based on algorithmic differentiation-for computing the next-token weight vector, i.e., the prefix weights of all one-token extensions, enabling efficient prediction of the next token. Together, these contributions yield a simple, general, and efficient framework for prefix parsing.
[NLP-138] How Well Can We Decode Vowels from Auditory EEG – A Rigorous Cross-Subject Benchmark with Honest Assessment
【速读】: 该论文旨在解决基于脑电图(EEG)的音位解码在跨被试场景下的可重复性与泛化能力问题,尤其针对听觉诱发的元音(a, e, i, o, u)分类任务。其关键解决方案在于构建了一个严格控制泄露(leakage control)的跨被试基准测试框架(leave-one-subject-out评估),并采用16名受试者、61通道、256 Hz采样的OpenNeuro ds006104数据集,系统比较了14种来自经典机器学习、深度学习和黎曼几何方法的解码管道。实验表明,在仅使用训练集标准化且无数据泄露的前提下,XGBoost全特征模型达到24.5%准确率(随机水平为20%),而基于微分熵特征的LightGBM在特定特征分析中提升至25.5%,验证了早期听觉诱发电位(ERP)是元音信息的主要载体,同时揭示了经典方法在低信噪比条件下仍具备与深度模型相当的竞争力。
链接: https://arxiv.org/abs/2605.00865
作者: Xiaoyang Li
机构: Northeastern University (东北大学)
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Neurons and Cognition (q-bio.NC)
备注: 31 pages, 11 figures; includes supplementary material (14 pages, additional figures and analyses)
Abstract:EEG based phoneme decoding is promising for brain computer interfaces, but many prior studies rely on within subject evaluation, small cohorts, or weak leakage control. We present a reproducible cross subject benchmark for five class vowel decoding (a, e, i, o, u) from auditory EEG using OpenNeuro ds006104 (16 subjects, 61 channels, 256 Hz). Under strict leave one subject out evaluation with training only normalization and explicit anti leakage checks, we compare 14 pipelines from classical machine learning, deep learning, and Riemannian methods. The best full feature model (XGBoost) reaches 24.5 percent accuracy (chance 20 percent), while differential entropy features with LightGBM reach 25.5 percent in feature specific analysis. After multiple comparison correction, strong pairwise model advantages are limited. Classical methods are competitive with deep models in this low signal regime. Additional analyses (ablation, pairwise vowels, within subject CV, ERP, temporal generalization, and electrode importance) indicate that vowel information is real but weak and mainly carried by early transient auditory responses. We release code and evaluation scripts for full reproducibility.
信息检索
[IR-0] AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
【速读】:该论文旨在解决个性化图像修复(Personalized Image Completion)中身份一致性难以保持的问题,即在修复个人照片中被遮挡区域时,如何确保恢复内容与原图人物身份和外观一致。现有方法要么依赖通用修复模型导致身份失真,要么假设参考图像已明确提供,这在实际场景中往往不成立。解决方案的关键在于提出一个无需训练的框架AlbumFill,其核心是利用视觉-语言模型(Vision-Language Model)从个人相册中自动检索身份一致的参考图像:通过推理缺失语义信息来指导组合式图像检索,并将检索到的参考图像用于基于参考的修复模型。该方案显著提升了个性化修复的质量,验证了身份一致参考检索的重要性。
链接: https://arxiv.org/abs/2605.02892
作者: Yu-Ju Tsai,Brian Price,Qing Liu,Luis Figueroa,Daniil Pakhomov,Zhihong Ding,Scott Cohen,Ming-Hsuan Yang
机构: University of California, Merced (加州大学默塞德分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Project Page: this https URL
Abstract:Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: this https URL
[IR-1] Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study ALT
【速读】:该论文旨在解决在生物医学这一高风险领域中,哪种检索策略能最有效地提升生成式 AI(Generative AI)问答系统性能的问题。其核心问题是:在固定生成模型、向量存储和嵌入编码的前提下,不同检索策略对最终答案质量的影响是否具有可量化差异,以及是否存在最优方案。解决方案的关键在于构建一个受控的检索增强生成(Retrieval-Augmented Generation, RAG)实验框架,系统比较五种主流检索策略——密集向量搜索、混合 BM25 + 密集检索、交叉编码重排序(Cross-Encoder Reranking)、多查询扩展(Multi-Query Expansion)和最大边际相关性(Maximal Marginal Relevance, MMR)——并采用四个 DeepEval 评估指标(上下文精度、上下文召回率、忠实度和答案相关性)进行多维对比。结果显示,交叉编码重排序在复合得分(0.827)和上下文精度(0.852)上表现最佳,验证了查询-文档交互机制在提升检索质量上的有效性,而其他策略则暴露出如噪声引入或权衡失衡等问题,从而为生物医学场景下的 RAG 系统设计提供了实证依据。
链接: https://arxiv.org/abs/2605.02520
作者: Devi Prasad Bal,Subhashree Puhan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages, 4 figures, 2 tables. Code and data: this https URL Also archived at Zenodo: this https URL
Abstract:Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies – Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) – within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI’s text-embedding-3-small embeddings, ensuring that observed differences are attributable to retrieval alone. Evaluation is conducted on 250 question-answer pairs drawn from a preprocessed subset of the BioASQ benchmark (rag-mini-bioasq) using four DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, each reported with 95% confidence intervals. A no-context ablation is included as a lower bound. Cross-Encoder Reranking achieves the best composite score (0.827) and highest contextual precision (0.852), confirming that query-document interaction yields measurable retrieval gains. Multi-Query Expansion, despite its recall-oriented design, produces the weakest contextual precision (0.671), suggesting naive query diversification introduces retrieval noise. MMR sacrifices answer relevancy for diversity, while the Dense baseline (composite 0.822) falls within 0.005 points of the top strategy. All RAG conditions dramatically outperform the no-context ablation on answer relevancy (0.658-0.701 vs. 0.287), confirming the practical value of retrieval. The full pipeline, hyperparameters, and evaluation code are publicly available.
[IR-2] GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing
【速读】:该论文旨在解决大规模多智能体协作中高效且准确的智能体发现(Agent Discovery)问题,当前方法面临两大瓶颈:一是依赖重型大语言模型(Large Language Model, LLM)进行意图解析导致延迟过高(常超过30秒),二是采用单一向量检索策略在速度与语义精度之间难以平衡。其解决方案的关键在于提出GRAIL框架,通过三项核心创新实现亚400毫秒级发现延迟而不牺牲准确性:(1) 使用微调的小语言模型(Small Language Model, SLM)替代通用LLM进行能力标签预测,实现毫秒级响应;(2) 引入伪文档扩展(Pseudo-Document Expansion)技术,通过合成查询增强智能体描述的语义密度,提升密集检索鲁棒性;(3) 设计最大相似度共振机制(MaxSim Resonance),基于用户查询与离散智能体使用案例之间的最大相似性计算,有效缓解语义稀释问题。
链接: https://arxiv.org/abs/2605.02489
作者: Jinliang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages, 5 figures
Abstract:As the ecosystem of Large Language Model (LLM)-based agents expands rapidly, efficient and accurate Agent Discovery becomes a critical bottleneck for large-scale multi-agent collaboration. Existing approaches typically face a dichotomy: either relying on heavy-weight LLMs for intent parsing, leading to prohibitive latency (often exceeding 30 seconds), or using monolithic vector retrieval that sacrifices semantic precision for speed. To bridge this gap, we propose \textbfGRAIL (Granular Resonance-based Agent/AI Link), a novel framework achieving sub-400ms discovery latency without compromising accuracy. GRAIL introduces three key innovations: (1) \textbfSLM-Enhanced Prediction, replacing the generalized LLM parser with a specialized, fine-tuned Small Language Model (SLM) for millisecond-level capability tag prediction; (2) \textbfPseudo-Document Expansion, augmenting agent descriptions with synthetic queries to enhance semantic density for robust dense retrieval; and (3) \textbfMaxSim Resonance, a fine-grained matching mechanism computing maximum similarity between user queries and discrete agent usage examples, effectively mitigating semantic dilution. Validated on \textbfAgentTaxo-9K, our new large-scale dataset of 9,240 agents, GRAIL reduces end-to-end discovery latency by over \textbf79 \times compared to LLM-parsing baselines, while significantly outperforming traditional vector search in Recall@10. This framework offers a scalable, industrial-grade solution for the real-time ``Internet of Agents."
[IR-3] Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval SIGIR2026
【速读】:该论文旨在解决专利新颖性评估(novelty assessment)任务中存在的粗粒度与潜在虚假相关性问题。传统方法将新颖性判断视为一个 claim-level 的二分类任务,难以捕捉具体技术特征与现有技术(prior art)之间的细粒度匹配关系,且易受训练数据中虚假相关性的干扰。其解决方案的关键在于提出 FiNE-Patents 数据集,该数据集包含 3,658 条首次专利权利要求,并标注了来自欧洲检索意见(ESOP)文档的细粒度特征级现有技术引用;同时,引入一种基于大语言模型(LLM)的工作流,将权利要求分解为技术特征,逐个分析其在现有技术中的披露情况,并最终进行 claim-level 的新颖性推理。此方法实现了从粗粒度二分类到细粒度检索与抽象推理相结合的新范式,显著提升了对关键创新特征的识别能力,并展现出对虚假相关性的鲁棒性。
链接: https://arxiv.org/abs/2605.02392
作者: Valentin Knappich,Anna Hätty,Simon Razniewski,Annemarie Friedrich
机构: Bosch Center for AI(博世人工智能中心); University of Augsburg(奥格斯堡大学); ScaDS.AI TU Dresden(德累斯顿工业大学ScaDS.AI); Stuttgart(斯图加特); Dresden(德累斯顿)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to SIGIR 2026 this https URL
Abstract:Novelty assessment is a critical yet complex task in the examination process for patent acceptance, requiring examiners to determine whether an invention is disclosed in a prior art document. The process involves intricate matching between specific features of a patent claim and passages in the prior art. While prior work has approached novelty prediction primarily as a binary classification task at the claim level, we argue that this formulation is susceptible to spurious correlations and lacks the granularity required for practical application. In this work, we introduce FiNE-Patents (Fine-grained Novelty Examination of Patents), a novel dataset comprising 3,658 first patent claims annotated with fine-grained, feature-level prior art references extracted from European Search Opinion (ESOP) documents. We propose shifting the evaluation paradigm from simple binary classification to a joint retrieval and abstract reasoning task at the feature level, requiring models to identify specific passages from a prior art document that disclose individual claim features, and to identify which features of a claim make it novel. We implement and evaluate LLM-based workflows that decompose claims into features, analyze each feature against prior art, and finally derive a claim-level novelty prediction. Our experiments demonstrate that these workflows outperform embedding-based baselines on passage retrieval and novel feature identification. Furthermore, we show that unlike trained classifiers, LLMs are robust against spurious correlations present in the claim-level novelty classification task. We release the dataset and code to foster further research into transparent and granular patent analysis.
[IR-4] Fair Agents : Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems
【速读】:该论文旨在解决多利益相关者(multistakeholder)环境下个性化系统中因目标冲突而导致的公平性问题,尤其是在使用大语言模型(LLM)代理进行自主优化时,如何确保各利益相关者的诉求被合理对齐并转化为可量化的代理目标,同时通过公平的聚合策略形成集体决策。其解决方案的关键在于提出一个概念框架,该框架包含三个核心组成部分:(i) 利益相关者目标与LLM代理的对齐方法,(ii) 基于社会选择理论的代理输出聚合策略以实现公平的集体决策,以及 (iii) 以利益相关者为中心的评估机制,用于衡量个体和整体代理行为的公平性表现。
链接: https://arxiv.org/abs/2605.02379
作者: Andrea Forster,Peter Müllner,Denis Helic,Elisabeth Lex,Dominik Kowald
机构: Fair-AI, Know Center Research GmbH (Fair-AI, 知识中心研究有限公司); Graz University of Technology (格拉茨工业大学); University of Graz (格拉茨大学)
类目: Information Retrieval (cs.IR)
备注: Accepted for publication in the Joint Proceedings of the UMAP 2026 Workshops (LLM4Good)
Abstract:LLM agents are increasingly used for personalization due to their ability to communicate directly with users in natural language, integrate external knowledge bases, and negotiate with other (possibly human) agents. Especially in multistakeholder AI systems with multiple distinct objectives, LLM agents are used to independently optimize for each stakeholder’s goals. Here, stakeholder alignment is essential to identify and map these goals to provide LLM agents with quantifiable objectives. Plus, the way in which the outputs of the LLM agents are aggregated is fundamental to ensuring fair outcomes for all agents and, therefore, stakeholders. In this work, we identify open research challenges and propose a conceptual framework for designing fair multi-agent multistakeholder personalization systems that balance competing stakeholder objectives. Our framework integrates (i) methods to align stakeholder objectives and LLM agents, (ii) aggregation strategies, e.g., based on social choice theory, to form fair collective decisions, and (iii) stakeholder-centric evaluation procedures for both individual and collective agent behavior. We showcase our framework through a tourism use case and discuss possible applications in other domains, such as education and healthcare. Finally, we discuss domain-specific fairness tensions and review datasets for evaluating multistakeholder fairness and multi-agent personalization systems.
[IR-5] Bridging Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation
【速读】:该论文旨在解决跨域序列推荐(Cross-domain Sequential Recommendation, CDSR)中两个关键问题:一是忽略不同领域间相同时间间隔内的交互频率差异和兴趣衰减率差异;二是将语义偏好视为时间不变量,导致跨域迁移时无法准确捕捉时间敏感的语义特征。解决方案的关键在于提出一个名为BST-CDSR的框架,其核心创新包括:(1) 设计行为偏好演化模块,通过神经常微分方程(Neural Ordinary Differential Equation, ODE)建模连续时间偏好,并以事件驱动方式更新,实现长期兴趣与短期意图的解耦;(2) 引入时间感知的反事实增强语义生成器,利用大语言模型(Large Language Models, LLMs)提取鲁棒的时间语义信息,并通过反事实扰动提升语义偏好的时间敏感性;(3) 提出时间偏好引导的域迁移模块,自适应调节迁移权重以缓解负迁移问题。
链接: https://arxiv.org/abs/2605.02369
作者: Zhida Qin,Zemu Liu,Haoyan Fu,Chong Zhang,Tianyu Huang,Yidong Li,Gangyi Ding
机构: Beijing Institute of Technology (北京理工大学); Xi’an Jiaotong University (西安交通大学); Beijing Jiaotong University (北京交通大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Cross-domain sequential recommendation (CDSR) alleviates interaction sparsity by jointly modeling user behaviors across multiple domains. While current studies have made some progresses, they still neglect two issues that severely impact recommendation performance: (i) ignoring domain-specific interaction frequencies and interest decay rates at identical time intervals; (ii) treating semantic preferences as time-invariant during cross-domain transfer. To address these, we propose a novel framework that bridges Behavior and Semantics for Time-aware Cross-Domain Sequential Recommendation (BST-CDSR). Specifically, we design a behavioral preference evolution module that decouples long-term interests and short-term intentions, and models continuous-time preference via a neural ordinary differential equation (ODE) with event-driven updates. Additionally, to capture time-aware semantic preferences, we introduce a temporal counterfactual-enhanced semantic generator that discretizes temporal interval tokens and leverages large language models (LLMs) to extract robust temporal semantics, where counterfactual perturbations enhance the time sensitivity of semantic preferences. Furthermore, we propose a time-preference guided domain transfer module to adaptively control transfer weights and mitigate negative transfer. Extensive experiments on real-world datasets demonstrate that BST-CDSR consistently outperforms baselines.
[IR-6] Enhancing Judgment Document Generation via Agent ic Legal Information Collection and Rubric-Guided Optimization
【速读】:该论文旨在解决生成式 AI (Generative AI) 在司法场景中自动撰写裁判文书时面临的两大核心挑战:一是法律信息检索不充分导致的证据召回不足,二是由于缺乏严谨逻辑推理而产生的法规引用幻觉和推理错误。解决方案的关键在于提出 Judge-R1 统一框架,通过两个创新模块协同优化:其一为“代理式法律信息收集”(Agentic Legal Information Collection),利用动态规划代理从多源数据库中精准检索相关法条与判例;其二为“基于评分标准的优化机制”(Rubric-Guided Optimization),采用 Group Relative Policy Optimization (GRPO) 结合全面的法律奖励函数,强化模型对司法规范和逻辑结构的遵循,从而显著提升裁判文书生成的法律准确性和逻辑合理性。
链接: https://arxiv.org/abs/2605.02011
作者: Weihang Su,Xuanyi Chen,Yueyue Wu,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学); Quan Cheng Laboratory (全成实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.
[IR-7] CyberAId: AI-Driven Cybersecurity for Financial Service Providers
【速读】:该论文旨在解决欧洲金融机构在网络安全运营中面临的“推理能力瓶颈”问题,即传统安全信息与事件管理系统(SIEM)仅覆盖部分MITRE ATT&CK技术、安全运营中心(SOC)团队难以应对警报洪峰、且多数已生成但未被调查的警报最终导致重大 breaches。现有大语言模型(LLM)虽在单一安全任务上表现卓越,却缺乏跨功能编排、多租户状态持久化、合规映射和审计生存能力。解决方案的关键在于构建一个混合多智能体系统(hybrid multi-agent system),其中专业化 LLM 子智能体基于经典 SIEM/XDR 数据进行推理,通过隐私保护联邦机制共享跨机构智能体状态,并集成量子认证、数字孪生等互补能力模块;平台 CyberAId 以主代理协调层为核心,在受限人机协同自主性下运行,围绕四个可证伪的设计原则组织架构,实现对金融场景的可验证防御增强与持续进化。
链接: https://arxiv.org/abs/2605.01892
作者: George Fatouros,Georgios Makridis,John Soldatos,Dimosthenis Kyriazis,Pedro Malo,George Kousiouris,Giannis Ledakis,Louiza Kachrimani,Panagiotis Rizomiliotis,Bruno Almeida,Despina Tomkou,Kostas Metaxas,Konstantinos Ilias,Christos Gkizelis,Ernstjan de Gooyert,Amin Babazadeh,Kostis Mavrogiorgos,Pepi Paraskevoulakou,Christos Xenakis,Giannis Chouchoulis,Konstantina Tripodi
机构: University of Athens (雅典大学); University of Piraeus (比雷埃夫斯大学); National and Kapodistrian University of Athens (国家和卡波迪斯特里安大学雅典); University of Patras (帕特雷大学); University of Lisbon (里斯本大学); University of Thessaly (塞萨洛尼基大学); University of Macedonia (马其顿大学); University of Nicosia (尼科西亚大学); Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基); University of Cyprus (塞浦路斯大学)
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 8 pages, 3 figures
Abstract:European financial institutions face mounting regulatory pressure while their security operations centres remain constrained not by data or staffing but by reasoning capacity: enterprise SIEMs cover only a fraction of MITRE ATTCK techniques, two thirds of SOC teams cannot keep pace with alert volumes, and the majority of breaches are preceded by alerts that are generated but never investigated. Frontier large language models now achieve state-of-the-art results on isolated cybersecurity tasks (one-day vulnerability exploitation, code-level patching, intrusion detection) yet no narrow win constitutes a platform that can compose across functions, persist multi-tenant state, map findings to regulatory regimes and survive an audit. This position paper argues that the right unit of construction is a hybrid multi-agent system in which specialised LLM subagents reason over classical SIEM/XDR telemetry rather than replacing it, share accumulated agent state across institutions through privacy-preserving federation, and can connect to complementary capability packs such as quantum-based authentication, digital twins for adversarial validation, and eBPF-based kernel telemetry. We present CyberAId, a model-agnostic, on-premise-deployable platform in which a Main Agent coordination layer, a Reporting capability, and specialist subagents operate within a shared runtime under bounded human-in-the-loop autonomy, organised around four falsifiable design principles, and aligned with relevant regulations. CyberAId will be validated at four representative financial use cases (client impersonation, anti-money-laundering for payment service providers, retail-banking incident response, and high-frequency-trading resilience) and propose skill-based agent adaptation as the most promising research direction for turning each deployment into a contribution to a continuously refined collective defence.
[IR-8] FEDIN: Frequency-Enhanced Deep Interest Network for Click-Through Rate Prediction SIGIR2026
【速读】:该论文旨在解决顺序推荐模型难以捕捉用户兴趣中潜在周期性模式的问题,这主要是由于时间域行为数据中固有的噪声干扰所致。其解决方案的关键在于提出了一种新颖的实证观察:当目标物品为正样本(即用户真实感兴趣的)时,用户注意力得分在频域呈现出低熵、高度集中的谱分布;而负样本则表现为高熵噪声。基于此,作者设计了频域增强的深度兴趣网络(Frequency-Enhanced Deep Interest Network, FEDIN),引入一个频域分支,并采用目标感知的谱滤波机制来分离出这些周期性兴趣信号,从而显著提升模型对噪声的鲁棒性与推荐性能。
链接: https://arxiv.org/abs/2605.01726
作者: Zenan Dai,Jinpeng Wang,Junwei Pan,Dapeng Liu,Lei Xiao,Shu-Tao Xia
机构: Tsinghua University (清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Tencent (腾讯)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by SIGIR 2026. 6 pages, 4 figures, 2 tables
Abstract:Sequential recommendation models often struggle to capture latent periodic patterns in user interests, primarily due to the noise inherent in time-domain behavioral data. While frequency-domain analysis offers a global perspective to address this, existing approaches typically treat user sequences in isolation, overlooking the crucial context of the target item. In this work, we present a novel empirical observation: user attention scores exhibit distinct spectral entropy distributions when conditioned on positive versus negative target items. Specifically, true user interests manifest as highly concentrated spectral patterns with lower entropy in the frequency domain, whereas irrelevant behaviors appear as high-entropy noise. Leveraging this insight, we propose the Frequency-Enhanced Deep Interest Network (FEDIN). FEDIN introduces a frequency-domain branch that utilizes a target-aware spectrum filtering mechanism to isolate these periodic interest signals. Extensive experiments on three public datasets demonstrate that FEDIN consistently outperforms state-of-the-art sequential recommendation baselines, demonstrating superior robustness against noise. We have released our code at: this https URL.
[IR-9] A Hybrid Retrieval and Reranking Framework for Evidence-Grounded Retrieval-Augmented Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)在生物医学和医疗相关问答任务中因缺乏可靠证据支撑而导致的回答可信度不足的问题。其核心挑战在于如何确保生成的答案基于准确、相关的外部文档证据,并能有效验证每个事实性陈述是否被源文档支持。解决方案的关键在于构建一个融合混合检索(hybrid retrieval)、Cohere 重排序(reranking)、保守提示(conservative prompting)与逐条事实核查(claim-level evaluation)的端到端框架:首先利用 Amazon Bedrock Knowledge Bases 进行文档处理与嵌入索引,通过混合检索获取候选证据块;随后采用 Cohere 模型对候选块进行重排序以提升相关性;在答案生成阶段仅使用排名靠前的证据块来控制输出内容;最后引入独立判别模型对每条生成的事实性陈述进行逐项验证,从而实现高达 100.0% 的事实支撑准确率,显著提升了 RAG 系统的可靠性与可解释性。
链接: https://arxiv.org/abs/2605.01664
作者: Fariba Afrin Irany,Sampson Akwafuo
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-augmented generation (RAG) improves large language model reliability by grounding generated responses in external evidence. However, RAG performance depends on the relevance of retrieved passages, the quality of evidence ranking, and the ability to verify whether generated claims are supported by source documents. This study presents a hybrid retrieval and reranking framework for citation-aware RAG in biomedical and healthcare-related document question answering. The framework uses Amazon Bedrock Knowledge Bases for document ingestion, parsing, chunking, embedding generation, and evidence retrieval. Source PDF documents are stored in Amazon S3, embedded using Amazon Titan Text Embeddings V2, and indexed with Amazon OpenSearch Serverless. Hybrid retrieval first retrieves candidate evidence chunks, and Cohere reranking then prioritizes the most relevant passages before answer generation. The answer-generation stage uses top-ranked evidence chunks to produce controlled, evidence-grounded responses, while a separate judge model evaluates each generated factual claim against the retrieved evidence. The framework was evaluated using 25 biomedical NLP and healthcare transformer queries as a pilot-scale proof-of-concept study. Across the evaluation set, the system retrieved and reranked 500 evidence chunks and generated answers from top-ranked evidence. Claim-level grounding evaluation extracted 200 factual claims, all of which were judged to be supported by retrieved evidence, resulting in 100.0% grounding accuracy. The results suggest that hybrid retrieval, reranking, conservative prompting, and claim-level evaluation can support reliable evidence-grounded RAG responses when sufficient source evidence is available.
[IR-10] Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
【速读】:该论文旨在解决神经排序模型(Neural Ranking Models, NRMs)在现代信息检索系统中对对抗性攻击的高度脆弱性问题。现有攻击方法多依赖启发式策略或替代模型,导致效果有限且迁移能力弱。其解决方案的关键在于提出一种名为CRAFT的监督式黑盒对抗排序攻击框架,该框架利用大语言模型(Large Language Models, LLMs)实现三阶段流程:首先通过检索增强生成与自-refinement生成对抗数据集;其次在人工筛选的对抗样本上进行监督微调;最后采用偏好引导优化以对齐生成内容与排名提升目标。此方法显著提升了攻击的有效性和跨不同排序架构(如交叉编码器、嵌入式模型及LLM-based排序器)的迁移能力,揭示了生成式AI在排名操纵中的潜在风险,并为构建更鲁棒的检索系统提供了基础。
链接: https://arxiv.org/abs/2605.01591
作者: Amin Bigdeli,Amir Khosrojerdi,Radin Hamidi Rad,Morteza Zihayat,Charles L. A. Clarke,Ebrahim Bagheri
机构: University of Waterloo(滑铁卢大学); University of Toronto(多伦多大学); Mila – Quebec AI Institute(魁北克AI研究所); Toronto Metropolitan University(多伦多都会大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Neural Ranking Models (NRMs) are central to modern information retrieval but remain highly vulnerable to adversarial manipulation. Existing attacks often rely on heuristics or surrogate models, limiting effectiveness and transferability. We propose CRAFT, a supervised framework for black-box adversarial rank attacks powered by large language models (LLMs). CRAFT operates in three stages: adversarial dataset generation via retrieval-augmented generation and self-refinement, supervised fine-tuning on curated adversarial examples, and preference-guided optimization to align generations with rank-promotion objectives. Extensive experiments on the MS MARCO passage dataset, TREC Deep Learning 2019, and TREC Deep Learning 2020 benchmarks show that CRAFT significantly outperforms state-of-the-art baselines, achieving higher promotion rates and rank boosts while preserving fluency and semantic fidelity. Moreover, CRAFT transfers effectively across diverse ranking architectures, including cross-encoder, embedding-based, and LLM-based rankers, underscoring vulnerabilities in real-world retrieval systems. This work provides a principled framework for studying adversarial threats in NRMs, underscores the risks of generative AI in rank manipulation, and provides a foundation for developing more robust retrieval systems. To support reproducibility, we publicly release our source code, trained models, and prompt templates.
[IR-11] KG-First LLM -Fallback: A Hybrid Microservice for Grounded Skill Search and Explanation
【速读】:该论文旨在解决权威能力框架(如ESCO、ROME和O*NET)因技术复杂性和结构异质性而导致教育工作者难以实际应用的问题。其核心解决方案是提出SkillGraph-Service,一个基于知识图谱(Knowledge Graph, KG)的互操作微服务架构,通过统一多源技能数据并保留溯源信息,实现教育与劳动力市场需求的有效对齐。该方案的关键在于采用“KG-first, LLM-fallback”设计:以轻量级混合检索引擎(融合SQLite FTS5与HNSW向量搜索)应对查询词汇不匹配问题,同时严格限定大语言模型(Large Language Models, LLMs)仅用于受限排序和面向受众的解释生成,从而在保证高效性(<200 ms延迟)和有效性(nDCG@50.94)的同时,兼顾可解释性与可审计性。
链接: https://arxiv.org/abs/2605.01582
作者: Ngoc Luyen Le,Marie-Hélène Abel,Bertrand Laforge
机构: UTC (Université de Technologie de Compiègne); IN2P3 (Institut National de Physique Nucléaire et de Physique des Particules)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Authoritative competency frameworks such as ESCO, ROME, and O*NET are essential for aligning education with labor market needs, yet their technical complexity and structural heterogeneity hinder practical adoption by educators. This paper introduces SkillGraph-Service, an interoperable microservice designed to bridge this gap by unifying these resources into a provenance-preserving Knowledge Graph (KG). Adopting a KG-first, LLM-fallback architecture, the system combines symbolic rigor with sub-symbolic flexibility. It implements a lightweight hybrid retrieval engine (fusing SQLite FTS5 and HNSW vector search) to handle the vocabulary mismatch in educator queries, and utilizes Large Language Models (LLMs) strictly for constrained ranking and audience-aware explanation. Empirical evaluation on a multilingual dataset reveals that the proposed hybrid strategy achieves superior retrieval effectiveness (nDCG@50.94) with sub-200 ms latency, rendering computationally expensive cross-encoder re-ranking may be unnecessary for this domain. Furthermore, an analysis of generated explanations highlights a trade-off between fluency and faithfulness: while JSON-constrained LLMs ensure high citation precision, deterministic templates remain the most reliable method for maximizing evidence coverage. The resulting architecture offers a practical, scalable, and auditable solution for integrating complex skill data into digital learning ecosystems.
[IR-12] Post-hoc Provider Fairness Adaptation via Hierarchical Exposure Alignment
【速读】:该论文旨在解决推荐系统中提供者曝光公平性(provider exposure fairness)的难题,即如何在不重新训练主模型的前提下实现灵活、可适应不同公平性需求的曝光控制,同时避免因全局公平目标忽视群体间与组内差异而导致的虚假公平现象。其核心解决方案是提出后处理公平性适配(Post-hoc Fairness Adaptation, PFA)框架,通过引入一个轻量级的“公平性适配器”(fairness adapter),基于用户-物品嵌入学习个性化的加性评分调整项,并将其注入原始排序分数以引导提供者的曝光分布趋向公平;进一步地,为应对结构性不公平问题,设计了分层曝光公平对齐(Hierarchical Exposure Fairness Alignment, HEFA)机制,显式平衡组间与组内提供者的曝光差异,并联合优化可微NDCG损失以维持推荐质量,从而实现高效且鲁棒的公平性调控。
链接: https://arxiv.org/abs/2605.01524
作者: Jingzhi Li,Zhiyong Cheng,Richang Hong,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Provider exposure fairness is crucial for sustaining a healthy content ecosystem and preventing monopolization in recommender systems. Yet, most existing methods either incorporate fairness constraints during model training, requiring expensive retraining when fairness objectives change, or rely on post-hoc reranking with fixed criteria, which lacks adaptability to diverse fairness requirements. To overcome these limitations, we propose Post-hoc Fairness Adaptation (PFA), a lightweight framework that equips a frozen recommender with a fairness adapter, enabling flexible fairness control without retraining the backbone model. Specifically, the fairness adapter learns personalized additive score adjustments from user-item embeddings, which are injected into the original ranking scores to steer provider exposure toward fairness. To train the adapter, we minimize the KL divergence between the actual and the target fair exposure distributions. However, this global objective implicitly treats all providers equally, ignoring structural disparities such as imbalanced provider group sizes and heterogeneous exposure within groups. Consequently, fairness may appear satisfied at an aggregate level while severe inter-group and intra-group exposure imbalances persist, undermining practical fairness. To address this, we design Hierarchical Exposure Fairness Alignment (HEFA), which explicitly balances inter- and intra-group provider exposure disparities, enabling flexible adaptation to diverse fairness requirements. To mitigate potential accuracy degradation, PFA jointly optimizes HEFA with a differentiable NDCG loss, enabling end-to-end fairness optimization while preserving ranking quality. Extensive experiments on three public datasets demonstrate that PFA achieves substantial fairness gains with negligible accuracy loss, consistently outperforming strong baselines.
[IR-13] Interactive Multi-Turn Retrieval for Health Videos
【速读】:该论文旨在解决健康视频检索系统在临床训练、患者康复和健康教育场景中因用户查询往往初始模糊、需通过多轮交互逐步明确约束条件(如姿势、手部位置、禁忌症、设备或患者状态)而带来的检索效果不佳问题。现有单轮检索系统难以满足此类复杂信息需求,导致交互脆弱且实用性受限。解决方案的关键在于提出一种对话感知的两阶段检索框架(Dialogue-Aware Two-Stage Retrieval, DATR),其核心包括:第一阶段采用基于CLIP风格的双编码器与稀疏帧采样实现高效粗筛;第二阶段通过多轮查询融合与轻量级交叉编码器对候选视频进行重排序,从而精准捕捉细粒度的程序性语义。该方法在自建的多轮健康视频检索语料库MHVRC上验证有效,显著优于主流文本-视频检索基线,并通过用户研究证明多轮查询能更准确地表达临床意图。
链接: https://arxiv.org/abs/2605.01409
作者: Chengzheng Wu,Ke Qiu,Baoming Zhang,Ruiyu Mao,Xulong Tang,Kaixing Yang
机构: Wuxi Dipont School of Arts and Science (无锡狄邦学校艺术与科学学院); Malou Tech (马洛科技); Chongqing University (重庆大学); Case Western Reserve University (凯斯西储大学); Renmin University of China (中国人民大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.
[IR-14] he Pre-Training Study of Expanded-SPLADE Models on Web Document Titles
【速读】:该论文旨在解决掩码语言建模(Masked Language Modeling, MLM)预训练模型在微调为神经双编码器(Neural Bi-Encoder)信息检索模型时存在的适应性不足与迁移学习效果不佳的问题。其核心解决方案在于系统性地评估不同预训练语料库和预训练策略对ESPLADE类稀疏表示模型(Expanded-SPLADE, ESPLADE)在检索微调中的影响,发现:使用通用语料库进行预训练并采用较高学习率(导致较低MLM精度)可显著提升微调后模型的检索有效性;同时,在严格稀疏剪枝条件下,较高的检索成本反而有助于维持更好的检索性能,揭示出检索效率与效果之间的权衡关系。这一发现为优化MLM预训练与特定检索架构(如SPLADE)间的对齐提供了实证依据。
链接: https://arxiv.org/abs/2605.01407
作者: Hiun Kim,Tae Kwan Lee,Taeryun Won
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2605.01407 [cs.IR] (or arXiv:2605.01407v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.01407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-15] Investigating the Effects of Different Levels of User Control in an Interactive Educational Recommender System
【速读】:该论文旨在解决教育推荐系统(Educational Recommender Systems, ERSs)中用户控制水平对学习者感知体验影响不明确的问题,尤其是不同层级控制(输入、处理、输出)如何作用于透明度、信任、满意度及感知质量等关键目标。其解决方案的关键在于设计并评估了一个交互式ERS——CourseMapper,允许学习者在输入层(用户画像构建与调整)、处理层(推荐算法逻辑理解)和输出层(推荐结果反馈)进行多维度干预,并通过一项包含184名参与者的被试间实验发现:仅允许用户构建和优化个人画像即可显著提升对系统的感知控制感,进而正向影响透明度、信任、满意度和感知质量;而额外的控制选项主要起到强化作用,其中输入控制对感知控制的影响最为显著。
链接: https://arxiv.org/abs/2605.01400
作者: Qurat Ul Ain,Mohamed Amine Chatti,William Kana Tsoplefack,Rawaa Alatrash,Shoeb Joarder
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Submitted to TORS. arXiv admin note: text overlap with arXiv:2501.12894
Abstract:Educational recommender systems (ERSs) are becoming increasingly important in enhancing educational outcomes and personalizing learning experiences by providing recommendations of personalized resources and activities to learners, tailored to their individual learning needs. While user control is widely assumed to improve user experience, the effects of different levels of control in ERSs remain underexplored. To address this gap, we designed and evaluated an interactive ERS within the MOOC platform CourseMapper, where learners could interact with the input (i.e., user profile), process (i.e., recommendation algorithm), and output (i.e., recommendations) of the system. We conducted a between-subjects user study (N=184) to examine how varying levels of user control in an ERS influenced users’ perceptions of the recommendation goals of perceived control, transparency, trust, satisfaction, and perceived quality. Our results show that enabling users to build and refine their profile is sufficient to promote positive perceptions of the ERS, while additional control options mainly reinforce these impressions. Moreover, perceived control is the only goal significantly affected by providing different levels of user control in the ERS, with input control exerting the strongest influence. Furthermore, different levels of control affect transparency, trust, satisfaction, and perceived quality in distinct yet interconnected ways. Overall, the findings provide empirical evidence that user control positively shapes transparency, trust, satisfaction, and perceived quality, though to varying extents.
[IR-16] Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning ACL2026
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)范式中,将原始检索到的文本直接注入大语言模型(Large Language Model, LLM)上下文时,导致检索信息与模型推理能力整合不佳的问题。其解决方案的关键在于引入语义标注(Verbal Annotations),即通过分析性叙事明确阐述查询与检索内容之间的逻辑关联,从而提升LLM生成准确且情境化响应的能力。基于此发现,作者提出了一种新型代理式RAG框架——Verbal-R3,其核心由生成器(Generator)和语义重排序器(Verbal Reranker)组成:前者执行迭代检索与推理,后者提供相关性评分及语义标注以引导生成过程;同时,通过相关性驱动的推理时扩展(relevance-guided test-time scaling),实现计算资源的高效分配,显著提升了复杂问答任务上的性能表现。
链接: https://arxiv.org/abs/2605.01399
作者: Sangkwon Park,Donghun Kang,Jisoo Mok,Sungroh Yoon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ACL 2026 Main Conference
Abstract:The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM’s reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM’s ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.
[IR-17] Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
【速读】:该论文旨在解决标准检索增强生成(Retrieval-Augmented Generation, RAG)系统在现实决策场景中因用户查询存在认知偏差(如错误前提或确认偏误)而导致的“相关性-鲁棒性鸿沟”(Relevance-Robustness Gap)问题。在此类场景下,单纯追求语义相关性反而会检索到迎合用户偏见的虚假证据,加剧模型幻觉。解决方案的关键在于提出 CoRM-RAG 框架,其核心是基于因果干预的反事实风险最小化(Counterfactual Risk Minimization),通过引入认知扰动协议(Cognitive Perturbation Protocol)模拟训练阶段的用户偏见,并将其蒸馏为轻量级证据评判模块(Evidence Critic),该模块能够识别在对抗性查询扰动下仍具足够证据强度的文档,从而提升决策安全性与鲁棒性。
链接: https://arxiv.org/abs/2605.01302
作者: Peiyang Liu,Qiang Yan,Ziqiang Cui,Di Liang,Xi Wang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); PX Securities (平安证券); City University of Hong Kong (香港城市大学); Tencent Technology (腾讯科技); Peking University (北京大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Standard Retrieval-Augmented Generation (RAG) systems predominantly rely on semantic relevance as a proxy for utility. However, this assumption collapses in realistic decision-making scenarios where user queries are laden with cognitive biases, such as false premises or confirmation bias. In such cases, maximizing relevance paradoxically promotes the retrieval of sycophantic evidence that reinforces hallucinations, a critical failure we term the ``Relevance-Robustness Gap’'. To bridge this gap, we propose CoRM-RAG (Counterfactual Risk Minimization for RAG), a framework that aligns retrieval with decision safety rather than mere similarity. Grounded in causal intervention, we introduce a Cognitive Perturbation Protocol to simulate user biases during training, which is then distilled into a lightweight Evidence Critic. This scoring module learns to identify documents that possess sufficient evidential strength to steer the model toward correctness despite adversarial query perturbations. Extensive experiments on decision-making benchmarks demonstrate that CoRM-RAG significantly outperforms strong dense retrievers and LLM-based rerankers in adversarial settings, while enabling effective risk-aware abstention through reliable robustness scoring. Our code is available at this https URL.
[IR-18] Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
【速读】:该论文旨在解决迭代检索增强生成(Iterative Retrieval-Augmented Generation, iRAG)系统在处理复杂多跳问题时存在的两大瓶颈:一是细粒度证据定位困难(Coarse-grained attribution),即用户需手动从长文档中寻找基于模糊文本引用的证据;二是视觉语义损失(Visual semantic loss),即传统文本解析方式会丢弃包含图表、布局等空间逻辑的视觉文档信息。解决方案的关键在于提出一种称为“证据链”(Chain of Evidence, CoE)的可检索器无关(retriever-agnostic)视觉归因框架,该框架利用视觉语言模型(Vision-Language Models, VLMs)直接对检索到的文档截图进行推理,输出精确的边界框(bounding boxes),从而实现像素级可解释的推理链可视化,并在结构化网页(Wiki-CoE)和含复杂图表的幻灯片(SlideVQA)两个基准上验证了其有效性,显著优于仅依赖文本的基线方法。
链接: https://arxiv.org/abs/2605.01284
作者: Peiyang Liu,Ziqiang Cui,Xi Wang,Di Liang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); Peking University (北京大学); City University of Hong Kong (香港城市大学); Tencent Technology (腾讯科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textitCoarse-grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textitVisual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbfChain of Evidence (CoE), a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbfWiki-CoE, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbfSlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at this https URL.
[IR-19] Multimodal Data Curation Through Ranked Retrieval ICLR
【速读】:该论文旨在解决多模态嵌入空间中因模态特异性过强和训练样本噪声导致的跨模态检索性能下降问题。具体而言,现有方法在训练过程中容易使嵌入空间偏向输入模态而非语义内容,同时混合异构的人工标注数据集时,标签噪声与模态偏差相互强化,进一步削弱了跨模态对齐效果。解决方案的关键在于两个核心组件:一是对称核子采样(Symmetric Nucleus Subsampling, SNS),通过修剪原始输入与标注中不一致的部分,提升训练对的语义一致性;二是专家嵌入引擎(Expert Embedding Engine, EEE),利用学习到的投影网络融合多个互补嵌入专家,并引入偏置感知目标函数以减少嵌入空间中的模态驱动分离。实验表明,该方法平均可将模态差距缩小超过90%,且作为数据编排策略显著优于分层采样和传统基线。
链接: https://arxiv.org/abs/2605.01163
作者: Pratyush Muthukumar,Harshil Kotamreddy,Sarah Amiraslani,Tomo Kanazawa,Ramani Akkati,Shaan Jain,Andrew Mathau
机构: NVIDIA (英伟达)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ICLR DATA-FM 2026
Abstract:Shared embedding spaces are widely used for multimodal search and data curation. In practice, two problems often limit how well this works. First, embeddings can reflect modality more than meaning, so examples cluster by input type even when the underlying content matches. Second, the paired supervision used to train these spaces is often noisy. When we blend many heterogeneous, human-labeled datasets, these issues reinforce each other and degrade cross-modal retrieval. We present a framework that improves alignment by acting on both the training pairs and the embedding model. Symmetric Nucleus Subsampling (SNS) refines training pairs by trimming raw inputs and annotations to the portions that best support each other. Expert Embedding Engine (EEE) combines complementary embedding experts using a learned projection network, together with a bias-aware objective that reduces modality-driven separation in the embedding space. We demonstrate that this approach collapses the modality gap by over 90% on average vs base embedding experts and is a strong data curator, with datablends from our method outperforming stratified sampling and traditional curation baselines in downstream model performance.
[IR-20] Seeking Information with RAG -Assistants: Does Model Size Matter in Human-AI Collaborations?
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)研究中普遍存在的问题:即多数工作聚焦于提升基准测试性能,而忽视了在真实世界人机协作流程中的实际表现评估。为应对这一局限,作者设计了一项基于检索增强生成(Retrieval-Augmented Generation, RAG)的聊天机器人助手实验,在模拟职场场景的多轮信息查询任务中,考察人类与不同规模模型(3B、8B、70B参数)协同工作的效果。解决方案的关键在于引入真实用户参与的多轮交互环境,系统性比较RAG辅助模式与纯LLM或LLM+RAG基线的表现,并同时衡量任务性能、可用性(usability)和满意度(satisfaction),从而揭示模型规模对人机协作动态的影响及其与用户体验之间的复杂权衡关系。
链接: https://arxiv.org/abs/2605.00964
作者: Lennard C. Froma,Tom Kouwenhoven,Maaike H.T. de Boer,Catholijn M. Jonker,Max J. van Duijn
机构: Leiden University (莱顿大学); TNO (荷兰应用科学研究组织); TU Delft (代尔夫特理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.
[IR-21] “I Dont Know” – Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation
【速读】:该论文旨在解决用户对大型语言模型(Large Language Models, LLMs)生成内容的信任问题,尤其是因模型可能产生幻觉(hallucination)且表现出过度自信而导致的判断困难。其核心挑战在于如何在LLM输出中体现适当的自我反思性确定性(self-reflected certainty),以建立合理的用户信任。解决方案的关键在于提出CERTA(Certainty Enhanced RAG for Trustworthy Answers),这是一种增强型检索增强生成(Retrieval Augmented Generation, RAG)系统,通过显式建模问题、上下文与答案之间的相关性来量化并反映模型回答时的不确定性;同时构建了包含90个非客观问题的“确定性基准”(Certainty Benchmark),涵盖事实性、偏好、奉承和道德四个类别及三类上下文类型,实验表明CERTA能有效识别不确定回答、减少过度认同现象,并在道德判断任务中表现出谨慎行为。
链接: https://arxiv.org/abs/2605.00957
作者: Daan Di Scala,Maaike de Boer,Pınar Yolum
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: To be published in VALE 2025 Proceedings
Abstract:Achieving the right amount of trust in AI systems is important, but challenging. The problem is exacerbated with the rise of Large Language Models (LLMs) as they provide human-level communication capabilities, but potentially hallucinate in the content that they generate. Moreover, they express over-confidence in their answers, making it difficult for users to judge their truthfulness. An important human value that users seek is benevolence, which can be met by LLM’s self-reflection leading to reliable and honest answers. Accordingly, this paper proposes conveying appropriate levels of self-reflected certainty to build appropriate trust. Our contributions are twofold: 1) We develop CERTA (Certainty Enhanced RAG for Trustworthy Answers), a specialized Retrieval Augmented Generation (RAG) system that incorporates the relevance between question, context, and answer to reflect its uncertainty in answering questions; 2) We create the Certainty Benchmark with 90 question-context pairs of non-objective questions, divided over four categories (factuality, preference, sycophancy, morality) and three types of contexts (relevant, incomplete, irrelevant). We run experiments with a baseline RAG system and three CERTA settings using two LLMs. Our evaluations indicate that CERTA helps identify answers that are uncertain, decreases the cases of over-agreeing, and provides cautious behavior when prompted for moral judgments.
[IR-22] SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
【速读】:该论文旨在解决在自然语言处理(Natural Language Processing, NLP)中,基于样本级别的排序(sample-level ranking)在存在冗余结构(如重复样本、近似重复和改写句)时稳定性不足的问题。现有方法通常对每个训练样本独立评分并排序,但在实际数据集中,这种点对点(pointwise)评分策略会导致相似样本因随机种子不同而产生不稳定的相对顺序,从而影响后续的数据分析、过滤、调试与选择等任务的可复现性。解决方案的关键在于提出SCARV框架,其核心是将两个模块化组件结合:一是鲁棒的多随机种子聚合(robust multi-seed aggregation),用于提升整体排序的全局和局部稳定性;二是结构感知的聚类聚合/分配步骤(structure-aware aggregation/allocation),针对冗余簇进行优化分配,尤其在聚合预算有限或冗余簇具有信息量时显著增益。实验证明,SCARV并非通用的数据选择器,而是作为代理评分(proxy-induced rankings)在冗余NLP数据集上的稳定化聚合层,有效提升了基于排名决策(如子集选取和可疑样本检索)的可靠性。
链接: https://arxiv.org/abs/2605.00944
作者: Xu Zheng,Feiyu Wu,Linhong Wu,Zhuocheng Wang,Hui Li
机构: School of Cyber Engineering, Xidian University (西安电子科技大学网络与信息安全学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Sample-level rankings are increasingly used in data-centric NLP for analysis, filtering, debugging, and curation, yet existing pipelines typically score training examples pointwise and rank them as if they were independent. This assumption is fragile in the presence of exact duplicates, near-duplicates, paraphrases, and other redundant structure common in NLP corpora, where stochastic training can make highly similar examples receive unstable relative orderings across random seeds. We study stable sample-level ranking under redundancy and propose \textscSCARV, a modular aggregation framework that operates on top of an existing scoring proxy. \textscSCARV combines robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, several NLP tasks, and end-to-end DistilBERT fine-tuning, \textscSCARV substantially improves over bare proxy rankings in global and local stability and yields more reproducible ranking-based decisions such as subset selection and suspicious-example retrieval. Our decomposition and compute-aware frontier sharpen the mechanism: robust multi-seed aggregation is the dominant generic stabilizer, while the structure-aware component adds value mainly under low aggregation budgets or when redundancy clusters are informative, naturally occurring, or sufficiently covered. These results position \textscSCARV not as a universal data selector or a universally dominant replacement for seed-only aggregation, but as a stability-oriented aggregation layer for proxy-induced rankings in redundant NLP datasets.
[IR-23] Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data
【速读】:该论文旨在解决生成式 AI (Generative AI) 在全切片图像(Whole-Slide Image, WSI)检索任务中相对于传统基于patch的特征提取与监督聚合方法的优势是否显著这一关键问题。研究通过在TCGA数据集上对10种不同架构进行患者级留一法评估,发现尽管滑动窗口基础模型TITAN整体表现最优,但其优势有限;相比之下,基于注意力机制的多实例学习(ABMIL)和patch-level检索方法在Top-1与Top-3准确率上表现相当,且无任何单一模型在所有器官和诊断类别中始终占优。关键发现在于:性能差异主要由器官和诊断类型决定,而非模型架构本身,且patch级特征表示是驱动性能的核心因素,滑动级别聚合带来的增益有限,暗示在多数场景下聚合步骤可能冗余。因此,论文主张采用器官特异性基准测试、诊断感知或集成策略、更强的特征表示以及多模态融合框架,而非追求统一最优架构。
链接: https://arxiv.org/abs/2605.00902
作者: Tianhao Lei,Parsa Esmaeilkhani,Saghir Alfasly,Wataru Uegami,Judy C. Boughey,Matthew P. Goetz,Krishna R. Kalari,H.R. Tizhoosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only \approx 68% \pm 21% retrieval accuracy on TCGA, and some subtypes showed 0% accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2605.00902 [cs.CV] (or arXiv:2605.00902v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-24] Retrieval-Guided Generation for Safer Histopathology Image Captioning
【速读】:该论文旨在解决生成式视觉语言模型(Generative Vision-Language Models)在病理学图像描述中存在幻觉、诊断结论过度具体及事实不一致等问题。解决方案的关键在于采用检索引导生成(Retrieval-Guided Generation, RGG)策略,即通过检索与当前图像视觉相似的已标注病例文本,并基于专家撰写的内容进行摘要式生成,而非从零开始生成描述。这种方法显著提升了语义一致性(在ARCH数据集上余弦相似度达≈0.60,优于MedGemma的≈0.47),并增强了病理术语的准确性与诊断的可靠性,同时具备更强的可审计性和透明性。
链接: https://arxiv.org/abs/2605.00893
作者: Md. Enamul Hoq,Wataru Uegami,Saghir Alfasly,Ghazal Alabtah,Sahar Rahimi Malakshan,Armita Kazemi,Alex T. Schmitgen,Fred Prior,H.R. Tizhoosh
机构: Kimia Lab, Department of Artificial Intelligence Informatics, Mayo Clinic, Rochester, MN, USA; Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA; Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA; Department of Computer Science and Engineering, Princeton University, Princeton, NJ, USA; Department of Computer Sciences, University of Wisconsin–Madison, Madison, WI, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of \approx 0.60 versus \approx 0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.
[IR-25] Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis
【速读】:该论文旨在解决文本到视频检索(Text-to-Video Retrieval)任务中模型性能差异的根源问题,包括模型行为机制、数据集影响以及查询难度因素。其解决方案的关键在于构建一个统一的预处理与评估框架,对14种前沿检索方法在3个主流数据集上进行系统性评测,并深入分析字幕特征(如长度、清晰度、语义类别及动作 vs 场景平衡)与模型表现之间的关联。研究发现,短而清晰的单一动作或颜色描述能显著提升召回率,而复杂事件或多步骤活动仍是现有模型的难点;同时,基于注意力机制的架构更擅长处理时序依赖性强的查询,而双编码器和多模态融合模型则在简单或单类别字幕上表现更优。这一分析揭示了查询内容与模型架构之间的协同关系,为未来设计更高效的文本到视频检索系统提供了关键指导。
链接: https://arxiv.org/abs/2605.00826
作者: Maria-Eirini Pegia,Dimitrios Stefanopoulos,Björn Þór Jónsson,Anastasia Moumtzidou,Ilias Gialampoukidis,Stefanos Vrochidis,Ioannis Kompatsiaris
机构: Institute of Informatics and Telecommunications (信息技术与电信研究所); Reykjavik University (雷克雅未克大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Survey, 50 pages, 15 figures, 13 tables, 154 citations
Abstract:Text-to-video retrieval enables users to find relevant video content using natural language queries, a task that has grown increasingly important with the rapid expansion of online video. Over the past six years, research has produced numerous methods, such as dual encoders, attention-driven models, and multimodal fusion approaches; however, fundamental questions remain about model behavior, dataset influence, and query difficulty. In this work, we evaluate 14 state-of-the-art retrieval methods across 3 widely used datasets under a unified preprocessing and evaluation framework. We analyze caption characteristics, including length, clarity, semantic category, and Action vs. Scene balance, and link these to model performance. Our results show that short, clear, and simple captions, such as those describing single actions or color attributes, achieve higher recall, while complex events, multi-step activities, or fine-grained scene descriptions remain challenging for all existing models. Attention-driven architectures better handle temporally dependent or multi-step queries, whereas dual-encoder and multimodal fusion models perform well primarily on simpler or single-category captions. Cross-dataset generalization improves with larger, more diverse caption sets, but generative captions do not consistently enhance retrieval accuracy. Overall, our findings highlight key dataset factors, benchmark challenges, and the interplay between query content and model architecture, providing guidance for developing more effective text-to-video retrieval systems.
[IR-26] Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
【速读】:该论文旨在解决传统单向量语音嵌入(speech embedding)无法分离语音中多个并行属性(如语言内容、说话人身份、方言、性别等)的问题,这些属性在单一向量空间中被混杂,限制了对特定属性的精确控制与检索。其解决方案的关键在于提出一种因子分割嵌入框架(factor-partitioned embedding framework),将每个语音片段映射为一个单一向量,但该向量的不同子空间对应于不同的属性轴(axis of variation)。通过共享的声学编码器配合各属性独立的线性投影头(projection heads),利用知识蒸馏或对比学习目标进行训练,最终实现基于属性条件的相似性计算:通过带符号的加权余弦分数组合各属性维度的相似度,支持联合考虑“说什么”和“如何说”,或显式抑制某一属性以突出另一属性,从而提升跨语料库检索中的语义匹配能力,例如抑制同说话人偏倚以识别不同录音条件下语义一致的语音片段。
链接: https://arxiv.org/abs/2605.02804
作者: Jim O’Regan,Jens Edlund
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
备注: 7 pages, accepted at Odyssey 2026
Abstract:Speech encodes multiple simultaneous attributes–linguistic content, speaker identity, dialect, gender–that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions. Comments: 7 pages, accepted at Odyssey 2026 Subjects: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR) Cite as: arXiv:2605.02804 [eess.AS] (or arXiv:2605.02804v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2605.02804 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jim O’Regan [view email] [v1] Mon, 4 May 2026 16:43:46 UTC (43 KB) Full-text links: Access Paper: View a PDF of the paper titled Multi-Axis Speech Similarity via Factor-Partitioned Embeddings, by Jim O’Regan and Jens EdlundView PDFHTML (experimental)TeX Source view license Current browse context: eess.AS prev | next new | recent | 2026-05 Change to browse by: cs cs.IR eess References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[IR-27] From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model
【速读】:该论文旨在解决高能物理(High Energy Physics, HEP)领域中,关于标准模型之外的新物理搜索文献日益增长且信息异构的问题,即文本分析、数值数据集和图形排除限等多源信息分散在不同平台,导致物理学家需耗费大量时间进行手动整合与交叉比对。解决方案的关键在于提出HEP-CoPilot——一个基于检索增强的多智能体AI框架,其核心是通过统一获取出版物中的文本信息、HEPData中的结构化实验数据以及重构的物理图表,在多模态检索与推理架构下实现证据驱动的推理和对撞机结果的结构化解读。该框架结合检索增强语言模型与协同智能体工作流,显著提升了跨论文实验约束的一致性比较能力,从而加速新物理搜索的解释流程,并可作为科学协作者支持复杂文献的导航与异构证据的系统性组织。
链接: https://arxiv.org/abs/2605.02491
作者: Altan Cakir,Ayca Yerlikaya
机构: Istanbul Technical University (伊斯坦布尔技术大学)
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 18 pages, 13 figures
Abstract:Modern searches for physics beyond the Standard Model produce rapidly expanding literature containing heterogeneous information, including textual analyses, numerical datasets, and graphical exclusion limits. Integrating these distributed sources remains a time-consuming and manual process for physicists. We present HEP-CoPilot, a retrieval-augmented multi-agent AI framework for the exploration and interpretation of high-energy physics literature. The system unifies textual information from publications, structured experimental data from HEPData, and reconstructed physics plots within a multimodal retrieval and reasoning architecture. By combining retrieval-augmented language models with coordinated agent workflows, it enables evidence-grounded reasoning over experimental analyses and structured interpretation of collider results. We evaluate the framework on recent CMS searches for physics beyond the Standard Model. Case studies show that HEP-CoPilot can retrieve relevant measurements, reconstruct exclusion limits directly from HEPData records, and perform cross-paper comparisons of experimental constraints. This enables consistent, physics-aware comparison across analyses without manual data integration. These results demonstrate that retrieval-augmented AI systems can function as scientific co-pilots for particle physics, facilitating navigation of complex literature, structuring heterogeneous evidence, and accelerating the interpretation pipeline for new physics searches. Comments: 18 pages, 13 figures Subjects: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2605.02491 [hep-ex] (or arXiv:2605.02491v1 [hep-ex] for this version) https://doi.org/10.48550/arXiv.2605.02491 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Altan Cakir [view email] [v1] Mon, 4 May 2026 11:42:14 UTC (9,184 KB)
[IR-28] oward a Scientific Discovery Engine for Weather and Climate Data: A Visual Analytics Workbench for Embedding-Based Exploration
【速读】:该论文旨在解决地球系统科学中高维、大规模数据集(如物理驱动的地球系统模型与AI驱动的天气和气候模型)在嵌入表示(embedding-based representations)下难以保证语义可解释性的问题,即嵌入空间中的最近邻可能反映的是预处理、地理特征或模型偏差,而非真实的气象结构。为应对这一挑战,论文提出一个开源的可视化分析工作台(visual analytics workbench),其核心在于将嵌入实验与原始数据、元数据、空间上下文及模型配置进行关联,使潜在空间的结果可追溯至物理机制;同时支持用户在不同模型的潜在空间中探索、执行全局或局部查询,并通过熟悉的气象视图检查相似事件,从而实现从已知数据中识别现象特征并迁移至更大、标注较少的数据集进行探测的发现式工作流。
链接: https://arxiv.org/abs/2605.00972
作者: Nihanth W. Cherukuru,Matt Rehme,Kirsten J. Mayer,David John Gagne,John Schreck,John Clyne,Charlie Becker
机构: NSF National Center for Atmospheric Research (美国国家大气研究中心)
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 5 pages, 3 figures, Preprint
Abstract:Earth system science is producing increasingly large, high-dimensional datasets from physics based Earth system models to AI-based weather and climate models. Embedding-based representations can make these data searchable through similarity search and analog retrieval, but nearest neighbors in latent space are not automatically scientifically meaningful: it may reflect real weather structure, or preprocessing, geography, or model bias. Researchers therefore need ways to inspect how embeddings organize meteorological data, compare representation models, develop retrieval strategies, and verify results against physical evidence. We present an open-source visual analytics workbench for each of these steps. The system links embedding experiments to source data, metadata, spatial context, and model configurations, so latent-space results can be traced back to the physics. Users can explore latent spaces for different models, issue global or localized queries, and inspect analogs through familiar meteorological views. This enables a discovery workflow in which scientists characterize a phenomenon of interest in a well-understood dataset, identifying its signature in latent space, and then use that signature to probe larger, less-labeled archives or ensembles for similar events. We demonstrate the workbench through tropical-cyclone retrieval using ERA5-derived embeddings and IBTrACS metadata, and evaluate its out-of-core retrieval backend to show that large embedding collections can be searched beyond in-memory limits on commodity workstation hardware. Comments: 5 pages, 3 figures, Preprint Subjects: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2605.00972 [physics.data-an] (or arXiv:2605.00972v1 [physics.data-an] for this version) https://doi.org/10.48550/arXiv.2605.00972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
人机交互
[HC-0] “I Dont Have Faith in the Developers to Use My Feedback”: Understanding Player Values and Expectancy for Reporting Systems in Video Games
【速读】:该论文旨在解决多人在线游戏中玩家报告系统(reporting system)有效性与用户动机之间的不匹配问题,即玩家在表达对游戏内不良行为的不满时,其期望的结果(如短期惩罚或长期社区改善)常因系统透明度不足、开发者信誉缺失或与社区价值观不一致而难以实现。解决方案的关键在于通过期望价值理论(expectancy-value theory)框架,深入理解玩家对报告行为的价值认知和预期结果,从而揭示当前报告机制设计中存在的缺陷,并提出应以提升开发者声誉、增强报告过程透明度及强化社区一致性为核心改进方向,以优化数字平台的治理效能与用户参与意愿。
链接: https://arxiv.org/abs/2605.02842
作者: Michael Yin,Chenxinran(Elise)Shen,Robert Xiao
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注: 29 pages, 2 figures, accepted at CSCW 2026
Abstract:Reporting systems in multiplayer video games allow players to express their dissatisfaction with others and combat in-game toxicity. In this work, we examined the act of reporting through the lens of expectancy-value theory. Using a distributed survey (n = 98) and follow-up interviews (n = 19), we explored the value players place on reporting, their desired outcomes, and their expectations that these outcomes will be achieved. Our findings revealed that reporting is motivated by both altruistic and retributive factors, with players seeking short-term revenge while also looking to foster an improved long-term community. Yet, players felt that reporting may not always meet these goals, with belief in the system being mediated by factors such as developer reputation, reporting transparency, and alignment with the community. By understanding the value and expectancy of reporting systems, we discuss their implications on broader digital moderation and consider current and potential future designs of reporting systems.
[HC-1] RACE: Temporal Reasoning over Context and Evidence for Activity Recognition in Smart Homes
【速读】:该论文旨在解决智能家庭环境中人类活动识别(Human Activity Recognition, HAR)的挑战,特别是由于日常活动在局部传感器模式上相似,且低侵入式传感导致观测稀疏和模糊,使得基于短时窗或事件窗口的方法难以捕捉可靠的长期时空上下文信息。其解决方案的关键在于提出一种名为TRACE(Temporal Reasoning over Context and Evidence)的上下文感知活动识别框架,该框架通过融合多源传感器证据与用户特定的上下文先验(contextual priors),将活动识别从局部分类问题转化为基于上下文推理的问题,从而有效缓解歧义、减少碎片化预测,并推断出更具语义特异性的活动类别。
链接: https://arxiv.org/abs/2605.02841
作者: Yingtian Shi,Abivishaq Balasubramanian,Jessica Herring,Jiachen Li,Juan Macias Romero,Rosemarie Santa Gonzalez,Varun Mishra,Agata Rozga,Xiang Zhi Tan,Thomas Plötz
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Human activity recognition (HAR) in smart homes remains challenging because many daily activities exhibit similar local sensor patterns, while minimally intrusive sensing provides sparse and ambiguous observations. As a result, methods based on short temporal or event windows often fail to capture the broader temporal and behavioral context needed for reliable activity understanding. We present TRACE (Temporal Reasoning over Context and Evidence), a contextual activity recognition framework for smart homes that integrates multi-source sensor evidence with user-specific contextual priors to improve activity interpretation. Rather than treating recognition as a local classification problem, TRACE leverages contextual reasoning to resolve ambiguities, reduce fragmented predictions, and infer more semantically specific activities. We evaluate TRACE on public benchmarks and in a deployment study conducted in our smart-home environment. Results show that TRACE improves recognition accuracy for semantically complex activities, produces more temporally coherent predictions that better align with user-specific routines, and maintains robust performance under cross-domain transfer and missing-modality conditions. These findings demonstrate the value of contextual reasoning for advancing smart-home HAR.
[HC-2] HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems
【速读】:该论文旨在解决组织设计中如何动态分配人类与人工智能(AI)任务这一核心挑战,尤其针对现有方法将人机协作视为二元选择的局限性。研究指出,实际场景中任务分配应根据上下文、疲劳程度和风险水平灵活调整,而当前缺乏有效的治理机制来平衡效率、监督与人类能力。解决方案的关键在于提出Human-AI Adaptive Symbiosis (HAAS)框架,其创新性地融合了两个耦合组件:一是规则驱动的专家系统,在学习前强制执行治理约束以确保安全合规;二是基于上下文-老虎机(contextual-bandit)的学习器,通过反馈结果从可行协作模式中自适应选择最优策略。HAAS还引入五维可审计的认知适配度指标与五级自主谱系,构建了一个跨软件工程与制造领域的可复现基准,从而实现对不同治理策略的评估与比较。
链接: https://arxiv.org/abs/2605.02832
作者: Vicente Pelechanoa,Antoni Mestre,Manoli Albert,Miriam Gil
机构: University of Valencia (瓦伦西亚大学); Polytechnic University of Valencia (瓦伦西亚理工大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution – balancing efficiency, oversight, and human capability – remains an open problem. This paper presents Human-AI Adaptive Symbiosis (HAAS), an implemented framework for adaptive task allocation in software engineering and manufacturing. HAAS combines two coupled components: a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback. Task-agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum – from human-only to fully autonomous – embedded in a reproducible benchmark spanning both domains. Three empirical findings emerge. First, governance is not a binary switch but a tunable design variable: tighter constraints predictably convert autonomous AI assignments into supervised collaborations, with domain-specific costs and benefits. Second, in manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously – a workload-buffering effect that contradicts the usual framing of governance as pure overhead. Third, no single governance setting dominates across all contexts; moderate governance becomes increasingly competitive as the learner accumulates experience within the governed action space. Together, these findings position HAAS as a pre-deployment workbench for comparing and inspecting human–AI allocation policies before organisational commitment.
[HC-3] Exploring Instant Photography using Generative AI: A Design Probe with the UnReality Camera
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 如何重塑即时摄影(instant photography)的体验性意义,特别是其具身性(embodied)和物质性(tangible)过程。解决方案的关键在于设计并部署“UnReality Camera”——一种由生成式 AI 介导的即时相机,用户通过语音输入作为生成式输入,相机将环境图像与AI生成内容融合后打印输出。实验表明,尽管用户重视对艺术创作的控制权,但随机不确定性带来的不可预测性反而激发了期待性悬疑(anticipatory suspense),而相机的物理形态则在人工生成内容背景下依然唤起用户的归属感与连接感,从而揭示了生成式 AI 嵌入后,人们对摄影体验的感知重构及其相互对立的 affordance(可供性)如何重新定义彼此的体验意义。
链接: https://arxiv.org/abs/2605.02805
作者: Michael Yin,Angela Chiang,Robert Xiao
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 3 figures, Accepted at ACM DIS 2026
Abstract:Generative AI has increasingly been used for artistic creation, but little work has explored how it shapes the experiential meaning of practice. We consider how generative AI might transform the embodied and tangible process of instant photography through the UnReality Camera, an AI-mediated instant camera. The UnReality Camera prints a photo of the environment augmented by a user’s spoken words as generative input. In a design probe, we explored how generative AI shapes people’s perceptions of both photographic output and the broader photographic process. Although users valued artistic control, they also appreciated the creativity afforded by stochastic unpredictability. The waiting period for an unpredictable output elicited anticipatory suspense, and the camera’s physical form evoked ownership and connection despite artificial generation. We discuss how people make sense of instant photography’s experiential qualities when generative AI is embedded, and how their opposing affordances reshape interpretations of each other’s experiential meaning.
[HC-4] U-Define: Designing User Workflows for Hard and Soft Constraints in LLM -Based Planning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在用户任务规划中因黑箱特性导致的可靠性与可控性不足问题,尤其关注用户如何有效应用约束条件来引导生成计划,同时应对现实世界中的不确定性。现有方法要么采用过于刚性的硬约束(hard constraints),难以表达复杂意图,要么引入数值灵活性权重(flexibility weights),易造成用户困惑。解决方案的关键在于提出U-Define系统,通过将约束抽象为高阶类型——必须严格遵守的“硬规则”和允许适度调整的“软偏好”,并分别采用形式化模型检测(formal model checking)和LLM作为评判者(LLM-as-judge)的互补验证机制,从而提升用户对意图表达的准确性、控制力及整体满意度,同时保持良好的可用性。
链接: https://arxiv.org/abs/2605.02765
作者: Christine P Lee,Xinyu Jessica Wang,Aws Albarghouthi,David Porfirio,Bilge Mutlu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); George Mason University (乔治梅森大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:LLMs are increasingly used for end-user task planning, yet their black-box nature limits users’ ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. We investigate how interaction workflows can better support users in applying constraints to guide LLM-generated plans, examining whether abstracting strictness into high-level types (i.e., hard and soft) paired with distinct verification mechanisms helps users more reliably express and align intent. We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. U-Define verifies these types through complementary methods: formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. Through a technical evaluation and user studies with general and expert participants, we find that user-defined constraint types improve perceived usefulness, performance, and satisfaction while maintaining usability. These findings provide insights for designing flexible yet reliable constraint-based workflows.
[HC-5] riple Spectral Fusion for Sensor-based Human Activity Recognition
【速读】:该论文旨在解决基于惯性测量单元(Inertial Measurement Units, IMUs)的传感器人体活动识别(Human Activity Recognition, HAR)中,由于异构传感器数据融合复杂性和长期上下文关联建模困难,导致时序信息融合效果不佳的问题。其解决方案的关键在于提出了一种新颖的三重频域融合框架:首先通过自适应互补滤波技术对噪声进行抑制,并将每个IMU的传感器划分为姿态和运动模态节点;其次,在图傅里叶域内实施自适应滤波以实现同质与异质节点信息的有效融合;最后引入自适应小波频率选择方法以抑制上下文冗余并压缩特征长度,从而增强基于时间戳的图聚合能力及长期上下文相关性。该框架在傅里叶域、图傅里叶域和小波域均采用自适应滤波机制,实现了多传感器高效融合与上下文关联建模。
链接: https://arxiv.org/abs/2605.02743
作者: Ye Zhang,Longguang Wang,Qing Gao,Chaocan Xiang,Mohammed Bennamoun,Yulan Guo
机构: Sun Yat-sen University (中山大学); Aviation University of Air Force (空军航空大学); Chongqing University (重庆大学); University of Western Australia (西澳大利亚大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU’s sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: this https URL.
[HC-6] Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents
【速读】:该论文旨在解决当前通用计算机使用代理(computer-use agents)在面对未见过且不断演化的用户界面(User Interface, UI)时泛化能力不足的问题。其核心挑战在于,尽管代理的智能水平不断提升,但若界面设计未考虑代理的认知先验知识,则易导致交互失败。解决方案的关键在于从人机交互领域的经典可用性启发式原则(Nielsen’s 10 usability heuristics)出发,重新审视哪些原则可自然迁移至代理场景,识别出因隐含设计假设引发的代理特异性失效,并引入安全的增强型设计改进(safe additive augmentations),从而提升代理对UI变化的鲁棒性,同时不损害人类用户的操作体验。通过构建受控环境UI-Verse进行实验验证,结果表明增强后的启发式设计能显著提高任务完成率,并在效率上带来适度改善,且不影响人类可用性,证明了界面设计作为提升代理可靠性与泛化能力的可行互补路径。
链接: https://arxiv.org/abs/2605.02729
作者: Jiateng Liu,Rushi Wang,Bingxuan Li,Kunlun Zhu,Yifan Shen,Qingyun Wang,Ahmed Abbasi,Denghui Zhang,Heng Ji
机构: University of Illinois Urbana-Champaign; The College of William Mary; University of Notre Dame; Stevens Institute of Technology
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 8 figures
Abstract:Recent advances have enabled general computer-use agents that interpret screens and execute grounded actions from human instructions, yet they still struggle to generalize to unseen and evolving interfaces. While improving agent capability remains important, agent compatible interface design offers a complementary path by aligning interaction semantics with agent prior knowledge. In this paper, we revisit Nielsen 10 usability heuristics through the lens of computer-use agents, identifying which principles naturally transfer, where implicit design assumptions create agent specific failures, and how safe additive augmentations can improve robustness without harming human usability. To evaluate these ideas, we introduce UI-Verse, a suite of controlled environments built around functionally similar interfaces with different applied heuristics. Experiments show that our augmented heuristics consistently improve task completion and modestly improve efficiency, with combined heuristics yielding further gains. Human studies further show that these designs preserve the original interaction workflow without observable usability regressions. Overall, our findings highlight interface design as a practical complementary avenue for improving the reliability and generalization of computer use agents.
[HC-7] ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor for Pair Programming
【速读】:该论文旨在解决当前自适应学习系统普遍以个体为中心、反应式干预难以有效支持协作学习的问题,特别是针对配对编程(pair programming)中协同注意力(Joint Visual Attention, JVA)、认知努力(Joint Mental Effort, JME)等动态协作状态缺乏前瞻性调节的挑战。其解决方案的关键在于提出 ProPACT——一个基于多模态双人学习者模型的主动式 AI 驱动协作导师,通过融合 JVA、JME 及个体认知负荷构建动态预测模型,并采用 XGBoost 算法提前 30 秒预测次优协作状态,进而触发分层自适应策略,在保持协作流畅性的同时提供最小侵入性支架支持,实现对协作过程的实时调控与优化。
链接: https://arxiv.org/abs/2605.02703
作者: Anahita Golrang,Kshitij Sharma,olga viberg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Effective pair programming depends on coordination of attention, cognitive effort, and joint regulation over time, yet most adaptive learning systems remain individual-centric and reactive. This paper introduces ProPACT, a proactive AI-driven adaptive collaborative tutor that treats collaboration itself as the object of instruction. ProPACT constructs a multimodal dyadic learner model based on Joint Visual Attention (JVA), Joint Mental Effort (JME), and individual mental effort, and employs an XGBoost-based forecasting model to predict emerging suboptimal collaboration states up to 30 seconds in advance. These predictions drive a hierarchical adaptive policy that delivers minimally intrusive scaffolds while fading support during productive collaboration. A within-subject study with 26 pair-programming dyads shows that proactive feedback significantly improves debugging success, task efficiency, feedback uptake, and post-intervention gains in JVA and JME, demonstrating the potential of forecast-driven dyadic adaptivity for real-time collaborative learning regulation.
[HC-8] he 2026 ACII Dyadic Conversations (DaiKon) Workshop Challenge
【速读】:该论文旨在解决当前对话情感建模研究中普遍存在的“以说话者为中心”(speaker-centric)的局限性,即现有基准测试未能充分刻画双人对话中交互双方之间动态耦合、时序协调及关系演变等复杂社会互动过程。其关键解决方案是构建一个名为ACII-DaiKon的多子任务挑战赛基准,基于Hume-DaiKon数据集(包含945组跨五种语言的自然情境双人对话,共计743.4小时音视频数据),设计三个协同子任务:(1)方向性人际影响预测,(2)发言权切换预测(下一说话者与下次发言时间),(3)全程互动中的亲和力轨迹预测。该方案通过固定训练/验证/测试划分、标准化评估指标(CCC、Pearson相关系数、Macro-F1、MAE)以及提供基线系统,支持多模态建模、时序推理和跨场景泛化能力的系统性评估,从而推动对双向依赖关系与长期人际动态建模的研究进展。
链接: https://arxiv.org/abs/2605.02672
作者: Panagiotis Tzirakis,Alice Baird,Jeffrey Brooks,Emilia Parada-Cabaleiro,Lukas Stappen,Sharath Rao,Theo Lebryk,Jakub Piotr Clapa,Jens Madsen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.02672 [cs.AI] (or arXiv:2605.02672v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02672 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-9] Dramaturgies of Deception: AI Humanizers and the Performance of Legitimacy in Higher Education Assessment
【速读】:该论文试图解决生成式 AI (Generative AI) 在高等教育评估中引发的“表演性循环”问题,即学生利用AI人类化服务(AI humanizers)规避检测、伪造独立创作行为,进而促使机构加强监控,形成技术对抗的恶性循环。其关键解决方案在于:不应依赖技术手段(如AI检测工具或更复杂的文本伪装技术)来应对这一现象,而应从结构层面改革评估方式,打破当前以形式主义和监控为导向的评估机制,从根本上重构教育评价体系以适应技术变革。
链接: https://arxiv.org/abs/2605.02649
作者: Jasper Roe(1),Mike Perkins(2),Peter Bannister(3),Leon Furze(4),James Wood(1) ((1) Durham University, United Kingdom, (2) British University Vietnam, Vietnam, (3) International University of La Rioja, Spain, (4) Deakin University, Australia)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 7 tables
Abstract:Artificial intelligence (AI) has disrupted assessment in higher education and accelerated a cycle of compounding performances. Institutional policies demand the demonstration of independent authorship, while commercial AI-enabled services allow students to simulate independent thought and writing. This has led to enhanced institutional surveillance, including AI detectors, which are subsequently circumvented using other technologies. AI humanizers, internet-based services that alter AI-generated text to avoid automated or human detection, are a recent symptom of this performative cycle. Little is known about how these services operate, how they appeal to users, and what they imply for educational assessment and integrity. This paper presents an exploratory, systematic investigation of AI humanizer websites, framed through Goffman’s sociological account of dramaturgy. Using a systematic search and custom rubric, we cataloged 55 humanizer sites, assessed their performance of identity, and conducted an in-depth multimodal critical discourse analysis of a purposive sample of three sites. Findings show that humanizers are readily available, offer free and premium paid services, and appear to perform similar functions. These include the deletion and discursive absence of misconduct, the framing of AI humanization as a rational and defensible response to surveillance and flawed detection, and appeals to mystification through advanced technology and implied endorsement by universities and corporations. We argue that humanizer services should be viewed as a diagnostic signal: a legible node in a feedback loop of performative assessment. Disrupting this cycle requires structural assessment reform rather than technological solutionism.
[HC-10] Evaluating Different Modalities of Behavioral Approach Tests for Spider Phobia in Virtual Reality WWW
【速读】:该论文旨在解决传统行为回避测试(Behavioral Approach Test, BAT)在评估特定恐惧症(如蜘蛛恐惧症)时存在的环境不可控与刺激变异大等问题。传统BAT依赖真实情境中的焦虑诱发刺激(如活体蜘蛛),其行为表现受个体差异和多次实验间变量干扰,影响结果的一致性与可重复性。论文提出以虚拟现实(Virtual Reality, VR)为基础的BAT实现方案作为解决方案,其关键在于通过标准化、可控的VR环境精确再现焦虑诱发刺激(虚拟蜘蛛),从而确保实验条件的一致性,并提升参与者的沉浸感(presence)。研究发现,VR-BAT不仅符合既定的沉浸感标准,且不同设计选项显著影响主观体验,同时生理指标(如皮肤电反应)与沉浸感存在相关性,表明该方法在量化回避行为方面具有较高潜力,是评估蜘蛛恐惧症患者回避行为的有效工具。
链接: https://arxiv.org/abs/2605.02546
作者: Florian Grensing,Vanessa Schmuecker,Anne Hildebrand,Tim Klucken,Maria Maleshkova
机构: Helmut-Schmidt-University (汉堡应用技术大学); University of Siegen (锡根大学); Bielefeld University (比勒费尔德大学); University of the Federal Armed Forces in Hamburg (汉堡联邦国防军大学)
类目: Human-Computer Interaction (cs.HC)
备注: 12 Pages, 4 Figures, Accepted Version in this https URL
Abstract:Behavioral approach tests are a common means of assessing specific phobias. In these tests, participants move towards an anxiety-inducing stimulus as close as they are willing to, with the final distance indicating the severity of the anxiety. In this work, we aim to evaluate a virtual reality implementation of the BAT. For this purpose, four different BATs were designed, consisting of two approach methods, both replicated in vivo and in virtuo. Evaluation of these BATs is done by using a standardised presence questionnaire, application-specific questions, as well as the physiological reactions of the participants. The study focuses on the fear of spiders and uses a real and virtual spider as an anxiety-inducing stimulus. Our results show that the developed VR BAT perform within established presence norms, while the different modalities influenced participants’ subjective impressions. Furthermore, the standardized structure of the VR environment ensured a consistent experience regarding the anxiety-inducing stimulus. This differs from the observation in the real-world setting, where the behavior of the spider might differ between individuals and also between sessions. This highlights one of the key advantages of virtual reality: complete control over the stimulus and environment. Correlations between presence and physiological signals were found. Particularly, tonic electrodermal activity levels are more stable with increased presence. However, more research into this is required, as the effects of anxiety on the physiological signals make the correlations difficult to interpret. The evaluation has revealed, which design choices are particularly promising for increasing presence in VR applications, and some which should be avoided. Overall, these results indicates that our VR-based implementation is a promising tool for assessing avoidance behavior for individuals with spider phobia.
[HC-11] Robotic Affection – Opportunities of AI-based haptic interactions to improve social robotic touch through a multi-deep-learning approach
【速读】:该论文旨在解决人机交互中情感性社会触觉(affective social touch)的难题,例如握手或安慰性抚摸等行为在机器人中的实现问题。当前研究受限于单一模态的机械执行方式,难以模拟人类触觉感知与情感反馈的复杂性,导致出现“触觉恐怖谷效应”(haptic uncanny valley)。解决方案的关键在于提出一种多模态架构,将情感触觉任务分解为若干专业化子任务,并借鉴神经生物学原理,将其视为一个分布式、闭环的感知过程而非单一运动行为;通过同侪对等的状态共享框架,实现跨领域协作与可扩展的仿真到现实(Sim-to-Real)开发路径,从而构建统一且具表现力的社会机器人系统。
链接: https://arxiv.org/abs/2605.02538
作者: Ali Askari,Jens Gerken
机构: TU Dortmund University (多特蒙德工业大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: AI for Haptics and Haptics for AI: Challenges and Opportunities Workshop at the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26), April 13 - 17 2026, Barcelona, Spain
Abstract:Despite the advancement in robotic grasping and dexterity through haptic information, affective social touch, such as handshaking or reassuring stroking, remains a major challenge in Human-Robot-Interaction. This position paper examines current progress and limitations across artificial intelligence, haptics and robotics research, and proposes a novel multi-model architecture to address these gaps. Drawing inspiration from neurobiology, we decompose affective touch into distinct, specialized subtasks models. By treating affective touch as a distributed, closed-loop perceptual task rather than a monolithic motoric movement, we aim to overcome the “haptic uncanny valley” through a peer-to-peer, state-sharing framework. Our approach supports scalable and cumulative development within a Sim-to-Real pipeline, fostering interdisciplinary collaboration. By enabling haptics, AI, and robotics researchers to contribute independently yet coherently, we outline a pathway toward a unified, expressive system for social robotics.
[HC-12] Shared Autonomy Assisted by Impedance-Driven Anisotropic Guidance Field
【速读】:该论文旨在解决共享自主(Shared Autonomy, SA)系统中人机意图理解不对称的问题,即现有研究侧重于提升机器人对人类意图的推理能力,却忽视了人类能否有效理解机器人的意图,导致协作效率下降和用户体验不佳。为填补这一空白,论文提出了一种基于阻抗控制(Impedance Control)的新范式——阻抗驱动各向异性引导场增强型共享自主(Impedance-Driven Anisotropic Guidance Field Enhanced Shared Autonomy, IAGF-SA),其关键在于引入一种具身化的、物理基础的通信通道,通过自适应调节机器人对人类输入的动力学响应,实现连续、直观且物理可感知的机器人意图表达,同时自然引导人类行为,从而显著提升任务表现、人机一致性及主观体验。
链接: https://arxiv.org/abs/2605.02410
作者: Sihan Chen,Hang Xu,Yupu Lu,Chen Wang,Benfang Duan,Ruixing Jia,Jia Pan
机构: The University of Hong Kong (香港大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 8 pages, 7 figures. Accepted for publication in IEEE Robotics and Automation Letters
Abstract:Shared autonomy (SA) enables robots to infer human intent and assist in its achievement. While most research focuses on improving intent inference, it overlooks whether humans can understand the robot’s intent in return. Without such mutual understanding, collaboration becomes less effective, degrading user experience and task performance. To address this gap, previous studies have explicitly conveyed the robot intent through additional interfaces, which remain unintuitive and limited in expressiveness. Inspired by impedance control, we propose Impedance-Driven Anisotropic Guidance Field Enhanced Shared Autonomy (IAGF-SA), a novel paradigm that extends SA with an embodied, physically-grounded communication channel. This channel adaptively modulates the robot’s dynamic response to human input, enabling intuitive, continuous, physically-grounded robot intent communication while naturally guiding human actions. User studies across three scenarios and two teleoperation interfaces indicate that IAGF-SA improves task performance, human-robot agreement, and subjective experience, thus demonstrating its effectiveness in enhancing human-robot communication and collaboration.
[HC-13] A Low-Code Approach for the Automatic Personalization of Conversational Agents
【速读】:该论文针对模型驱动工程(Model-Driven Engineering, MDE)领域中用户建模(User Modeling)研究碎片化、维度覆盖不全及缺乏动态演化能力的问题展开系统性文献综述(SLR)。当前研究多集中于易获取的静态特征,且工具支持仅限于用户模型的创建,未能实现基于用户交互的自动更新与个性化应用适配。解决方案的关键在于推动社区达成共识,构建一个统一且可复用的用户模型框架,涵盖现有文献中的全部维度并融合其他领域(如社会学)的用户画像经验;同时,引入基于机器学习(Machine Learning, ML)的新一代方法,实现从用户交互数据中自动、增量式地提取用户特征,并通过自动化流水线将用户信息转化为具体的应用个性化调整,从而提升软件系统的适应性与用户体验。
链接: https://arxiv.org/abs/2605.02384
作者: Aaron Conrardy,Alfredo Capozucca,Jordi Cabot
机构: Luxembourg Institute of Science and Technology (卢森堡科学与技术研究所); University of Luxembourg (卢森堡大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: To appear in the main track of the International Conference on Web Engineering (ICWE 2026)
Abstract:In this paper, we conducted an SLR on the state of user modeling in the MDE domain. Results show a diverse set of disconnected proposals, covering a partial number of dimensions with an emphasis on those characteristics that are easier to profile. Moreover, most dimensions are regarded as fixed instead of allowing their dynamic evolution during the interaction with the software application. It is also worth noting that tool support is also rather limited, mostly limited to enabling the creation of the user models itself. The roadmap we hope to see in this area stems from the discussion points seen above. For instance, we believe the community should agree on a unified and re-usable user model, covering the superset of all dimensions present in the literature. Plus additional ones we could learn from user profiling in other domains (e.g. sociology). On the technical side, we expect to see a new generation of ML-based proposals to automatically and incrementally derive a user profile from the analysis of user interactions and a number of automatic pipelines able to transform the user information in concrete application adaptations that personalize the application to cater to the user’s needs and profile. Comments: To appear in the main track of the International Conference on Web Engineering (ICWE 2026) Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.02384 [cs.SE] (or arXiv:2605.02384v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.02384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-14] From Here to There: Exploring Proximity Semantics in Multimodal Data Exploration
【速读】:该论文旨在解决现代数据探索工具在捕捉用户分析意图方面的局限性,尤其是在用户试图发现难以通过传统查询方法或自然语言单独表达的复杂模式时。其核心问题是现有系统缺乏对多模态输入(如草图、自然语言和视觉标注)之间语义关联的有效建模与整合能力。解决方案的关键在于提出一种融合自由形式草图、自然语言和视觉标注的多模态研究探针(multimodal research probe),并在统一交互空间中利用几何草图匹配与视觉语言模型(Visual Language Models, VLMs)组成的混合架构,实现模式匹配与语义约束的协同推理。研究进一步揭示了“邻近语义”(Proximity Semantics, PS)这一现象,即用户通过多模态元素在共享交互空间中的相对位置关系来消解歧义,从而为未来多模态数据探索系统的交互设计提供了基于实证行为的新视角。
链接: https://arxiv.org/abs/2605.02261
作者: Dennis Bromley,Diana Wang,Vidya Setlur
机构: Tableau Research
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 5 figures
Abstract:Modern data exploration tools often struggle to capture the subtleties of analytical intent, especially when users seek patterns that are difficult to specify using traditional query methods or natural language alone. We introduce a multimodal research probe for querying time-series and geospatial data that integrates free-form sketching, natural language, and visual annotations within a unified interaction space. Users articulate queries by sketching trends or spatial paths and augmenting them with annotations and analytical directives grounded in shared spatial and temporal context. The system employs a hybrid architecture combining geometric sketch matching and visual language models (VLMs) to support queries that interleave pattern matching and semantic constraints. Through a preliminary study with 20 participants, we observed recurring interaction patterns in which participants used spatial, temporal, and visual proximity to relate sketches, annotations, and language. Rather than treating these as isolated inputs, participants relied on their relative placement to disambiguate meaning. We analyze these behaviors as evidence for proximity semantics (PS), a form of deictic disambiguation in which meaning is shaped by the closeness of multimodal elements within a shared interaction space. We present PS as a conceptual lens grounded in observed user behavior, and discuss its implications for the design of future multimodal data exploration systems.
[HC-15] Stochastic Modeling of Human-Machine Authentication Channels under Partial Information Leakage
【速读】:该论文旨在解决物联网(IoT)环境中基于PIN的用户认证机制在部分信息泄露情况下的可靠性退化问题,传统评估方法将此类通道视为全安全或全失效的二元状态,忽略了实际场景中因局部符号暴露导致的渐进式可靠性下降。解决方案的关键在于构建一个上下文条件化的概率推理框架,将缺失的数字视为隐变量,并利用带有回退先验的平滑条件概率分布进行估计;该方法不显式建模隐藏状态转移或发射概率,而是通过上下文驱动的概率推断来近似跨数字位置的潜在依赖关系,从而量化部分暴露下的可靠性损失与服务质量(QoS)下降。
链接: https://arxiv.org/abs/2605.02102
作者: Nilesh Chakraborty,Mohammad Zulkernine,Burak Kantarci
机构: University of Ottawa (渥太华大学); Queen’s University (皇后大学)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, Accepted to 2026 IEEE International Conference on Cyber Security and Resilience (CSR)
Abstract:Reliable and secure human-machine communication is fundamental to IoT and cyber-physical ecosystems, where smartphones and wearables commonly serve as authentication controllers. PIN-based authentication can be viewed as a low-bandwidth communication channel through which users transmit numeric credentials under practical constraints. However, conventional evaluations adopt a binary view of security-treating such channels as either fully secure or fully compromised-thereby overlooking the progressive reliability degradation caused by partial information leakage in real-world IoT settings. In this paper, we model the PIN entry process as a stochastic human-IoT communication system and propose a context-conditioned probabilistic inference framework to quantify reliability loss and Quality-of-Service degradation under partial symbol exposure. The proposed approach treats missing digits as latent variables and estimates them using smoothed conditional probability distributions with fallback priors. Unlike traditional sequential models that assume contiguous positional dependencies, the method does not explicitly parameterize hidden-state transitions or emissions; instead, it performs context-driven probabilistic inference to approximate latent dependencies across digit positions. Using over one million real-world four-digit PIN samples, we evaluate single-, double-, and triple-digit leakage scenarios and derive position-dependent reliability metrics. The proposed model achieves up to 55.31% prediction accuracy for one missing digit and 12.12% for three missing digits, while consistently outperforming a standard sequence-model baseline and classical machine learning models in terms of precision, recall, and F1-score. These results formalize PIN entry as a noisy human–IoT communication channel and demonstrate substantial reliability degradation under realistic partial exposure conditions.
[HC-16] Cripping AI: Reimagining AI Through Lived Disability Experiences
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)研究与开发中忽视残疾人群体 lived experience(生活经验)的问题,尤其批判性地指出现有AI系统往往隐含能力主义(ableist)假设,未能真正将残障者的知识体系和劳动价值纳入设计过程。其解决方案的关键在于提出“cripping AI”(残障化AI)这一理论框架,通过三个核心路径实现:第一,揭示并瓦解AI在想象、设计与评估阶段所嵌入的能力主义偏见;第二,以残障者特有的认知方式(即“cripistemologies”,残障认识论)为中心重构知识生产;第三,尊重残障者在共创可访问实践中的劳动贡献。该框架通过聋人与手语AI、盲人与视觉辅助AI、口吃与语音AI三个案例得到实证应用,为未来从多元身体心智出发、贯穿AI全生命周期并协同其他正义导向AI工作的研究指明方向。
链接: https://arxiv.org/abs/2605.02080
作者: Xinru Tang,Ting-an Lin,Jingjin Li,Shaomei Wu
机构: University of California, Irvine (加州大学欧文分校); AImpower.org (AImpower.org); University of Connecticut (康涅狄格大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Drawing on crip theory, this paper proposes cripping AI as a guiding framework to center lived disability experiences in AI research and development. Moving beyond calls to make AI “accessible” to people with disabilities, cripping AI seeks to: (1) reveal and dismantle ableist assumptions embedded in how AI is imagined, designed, and evaluated; (2) center disabled ways of knowing (i.e., cripistemologies); (3) respect disabled labor in co-creating accessible practices. We demonstrate how to apply our framework with three cases: deafness and sign language AI, blindness and visual assistive AI, and stuttering and speech AI. We end by outlining three directions for future work, including cripping AI with diverse human bodyminds, across the entire AI pipeline and ecosystem, and in collaboration with other justice-oriented AI efforts.
[HC-17] Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
【速读】:该论文旨在解决当前人工智能(AI)评估中缺乏统一、严谨的随机对照试验(RCT)标准的问题,尤其是在人类表现作为核心指标的场景下。其解决方案的关键在于构建一个包含五项原则(内部效度、外部效度、建构效度、统计结论效度及透明性、可重复性和可验证性)的标准化框架,并将其具体化为33条适用于AI评估RCT的指导原则,涵盖设计要求、实施说明与证据基础。该框架不仅强化了因果推断能力(通过RCT方法论),还整合了异质性分析、实际显著性评估以及AI特有的挑战应对机制(如模型版本管理、人机交互动态、污染与溢出效应等),从而为AI评估提供兼具科学严谨性与实践可行性的系统性工具和未来标准制定蓝图。
链接: https://arxiv.org/abs/2605.02050
作者: Christopher Kelly,Angelica Chowdhury,Alexandra Campili,Bimpe Ayoola,Devin Barbour,Thomas Chen Dawson,Ze Shen Chin,Rokas Gipiškis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 27 pages, Technical AI Safety Conference
Abstract:This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
[HC-18] Privy: From Fine Print to Fair Practice in Privacy Rights Exercise
【速读】:该论文旨在解决用户在实际操作中难以行使隐私权利的问题,尤其是在面对模糊的隐私政策解读和不友好的网页界面设置时。解决方案的关键在于引入Privy——一个基于大语言模型(Large Language Model, LLM)的浏览器助手,其核心能力是自动分析网站隐私政策并以可操作的标签形式呈现可用权利,同时提供分步引导、直接链接、邮件模板生成及表单填写协助,并支持按需获取政策证据与隐私教育内容。该方案将隐私政策理解与具体隐私行动整合为统一交互流程,显著提升了用户对隐私权利的认知与执行效率。
链接: https://arxiv.org/abs/2605.02005
作者: Qi Sun,Ziyang Li,Yinzhi Cao,Yaxing Yao
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Privacy regulations such as the CCPA and GDPR grant individuals rights over their personal data, yet it remains challenging for most users to exercise them in practice due to vague policy interpretation and unapproachable settings on web interfaces. We introduce Privy, an LLM-powered browser assistant that guides users through exercising their privacy rights on websites. Privy automatically analyzes a website’s privacy policy and surfaces the specific rights available as action labels in a side panel. When a user selects a right, Privy provides step-by-step guidance and navigation, presenting direct links, generating email templates, or guiding form completion. Users can also request on-demand policy evidence and rights education to enhance their literacy. A technical evaluation across 14 websites shows that Privy extracts rights with high precision (0.979) and completes 96.3% of privacy tasks in an average of 3.2 steps. A user study (N=15) also demonstrates the overall high-level of perceived helpfulness among users. Our findings suggest that comprehension and usability are not two separate challenges but a single interaction problem, and that effective privacy support requires integration of policy understanding and privacy actions. We offer design suggestions for future privacy assistants.
[HC-19] LLM -Augmented Semantic Steering of Text Embedding Projection Spaces
【速读】:该论文旨在解决文本嵌入(text embeddings)低维投影在文档集合可视化分析中,其空间布局难以反映分析师意图的问题。现有语义交互方法通过几何约束或模型更新间接编码语义意图,导致解释性与灵活性受限。解决方案的关键在于提出LLM增强的语义引导(LLM-augmented semantic steering),即分析师仅需对少量示例文档进行分组,由大语言模型(Large Language Model, LLM)将此意图外化为自然语言表示,并选择性地扩展至相关文档;随后通过文本增强或嵌入级融合(embedding-level blending)将语义信息注入文档表示,无需重新训练底层模型即可实现投影空间的语义重构。该方法使投影空间成为可由显式、可解释的语言媒介驱动的意图依赖型语义工作区。
链接: https://arxiv.org/abs/2605.01957
作者: Wei Liu,Eric Krokos,Kirsten Whitley,Rebecca Faust,Chris North
机构: Virginia Tech (弗吉尼亚理工大学); Department of Defense (美国国防部); Tulane University (杜兰大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted to AVI '26 (International Conference on Advanced Visual Interfaces). Author’s version. 9 pages, 4 figures
Abstract:Low-dimensional projections of text embeddings support visual analysis of document collections, but their spatial organization may not reflect the relationships an analyst intends to examine. Existing semantic interaction approaches encode semantic intent indirectly through geometric constraints or model updates, limiting interpretability and flexibility. We introduce LLM-augmented semantic steering, which enables analysts to express semantic intent by grouping a small set of example documents within the projection. A large language model externalizes this intent as natural-language representations and selectively extends it to related documents; the resulting semantic information is then incorporated into document representations via text augmentation or embedding-level blending, without retraining the underlying models. A case study illustrates how the same corpus can be reorganized from different semantic perspectives, while simulation-based evaluation shows that semantic steering improves global and local alignment with target semantic structures using only minimal interaction. Embedding-level blending further enables continuous and controllable steering of projection layouts. These results position projection spaces as intent-dependent semantic workspaces that can be reshaped through explicit, interpretable, language-mediated interaction.
[HC-20] Phone2Act: A Low-Cost Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection
【速读】:该论文旨在解决Vision-Language-Action (VLA)模型训练中高质量操作数据收集成本高昂的问题,尤其是现有遥操作框架依赖专用硬件或与特定机器人平台强耦合,限制了研究群体的可及性。其解决方案的关键在于提出Phone2Act——一个低成本、硬件无关的遥操作框架,通过Google ARCore将普通智能手机转化为6自由度(6-DoF)机器人控制器,并基于模块化的ROS 2架构实现控制逻辑与硬件细节的解耦,支持从工业协作机器人到低成本双臂机器人的多种平台,无需代码修改;同时集成通用录制器(Universal Recorder),同步多摄像头RGB流与机器人状态反馈并原生导出LeRobot数据集格式,避免后期处理,使VLA模型可立即进行微调。
链接: https://arxiv.org/abs/2605.01948
作者: Om Mandhane,Bipin Yadav,Sangeetha Prasanna Ram,Gopalakrishnan Narayanan
机构: Vivekanand Education Society’s Institute of Technology (VESIT)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures
Abstract:Collecting diverse, high-quality manipulation data for Vision-Language-Action (VLA) model training remains prohibitively expensive for many research groups, as existing teleoperation frameworks rely on specialized hardware or are tightly coupled to specific robot platforms. We present Phone2Act, a low-cost, hardware-agnostic teleoperation framework that transforms a commodity smartphone into a 6-DoF robot controller via Google ARCore. Built on a modular ROS 2 architecture, Phone2Act decouples control logic from hardware specifics through interchangeable bridge nodes, supporting platforms from industrial cobots to low-cost bimanual arms without code modification. A Universal Recorder synchronizes multi-camera RGB streams with robot state feedback and exports demonstrations natively in the LeRobot dataset format, eliminating post-processing and enabling immediate VLA fine-tuning. We validate the framework by fine-tuning GR00T-N1.5 on 130 collected episodes, achieving a 90% success rate on a real-world multi-stage pick-and-place task deployed on a physical Dobot CR5.
[HC-21] Less Interaction But More Explanation: A Communication Perspective on Agent ic AI Interfaces
【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 向更具代理性的方向发展(即 agentic AI),用户与系统之间的交互模式如何变化,以及这种变化对用户信任和人类控制权的影响。传统 AI 系统主要作为响应性工具,而 agentic AI 则主动执行任务流程,这导致用户不再需要频繁的例行互动,但需更多关于监督和解释的沟通。解决方案的关键在于,agentic AI 必须引入三种类型的解释机制——行动-过程解释(action-process)、不确定性解释(uncertainty)和协调解释(coordination),并通过可定制的呈现机制让用户自主决定何时以及以何种方式接收这些解释,从而在提升 AI 自主性的同时维持人类的主体性(human agency)。
链接: https://arxiv.org/abs/2605.01610
作者: Eunchae Jang,S. Shyam Sundar
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:AI systems have long been expected to interact with users, answering questions, generating content, and continuing (social) conversations. Agentic AI, however, breaks from this expectation, as its primary objective is workflow execution on behalf of the users. If a system becomes more agentic, do users need less interaction with the system? Our answer is: less routine back-and-forth, but more communication for oversight and explanation, as agentic AI proactively acts, not just responds. Grounded in a communication perspective, we discuss how users perceive the communicative roles of AI systems (whether as the source of actions or merely a channel), and how this can shape trust. Because agentic AI can play multiple communicative roles, it can complicate this source perception and introduce potential risks. To address this, we propose three types of explanations that agentic AI needs to incorporate (action-process, uncertainty, and coordination), and suggest that customization affordances that allow users to decide when and which explanations they see may be key to preserving human agency as AI autonomy increases.
[HC-22] FlowBook: Enforcing Reproducibility in Computational Notebooks
【速读】:该论文旨在解决计算笔记本(computational notebooks)在执行过程中因细胞(cell)非顺序执行而导致的可复现性失败问题。传统方法依赖于精确的依赖分析或反应式数据流模型,但面临表达能力、精度与性能之间的权衡。其解决方案的关键在于提出一种无需精确依赖追踪即可强制可复现性的新思路:只要从空存储状态按自顶向下顺序执行所有细胞,能产生当前记录的全部输出,则该笔记本即为可复现的。作者基于此定义设计了FlowBook系统,通过动态分析跟踪每个细胞边界上的读写集合(read and write sets),识别过期细胞并阻止破坏可复现性的操作,从而在保持高效率的同时实现可靠验证(中位延迟仅70毫秒)。
链接: https://arxiv.org/abs/2605.01560
作者: Stephen N. Freund,Emery D. Berger,Cormac Flanagan,Eunice Jun
机构: Williams College (威廉姆斯学院); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Amazon Web Services (亚马逊网络服务); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Programming Languages (cs.PL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Computational notebooks are notoriously prone to reproducibility failures. By permitting out-of-order cell execution, notebooks accumulate hidden state and implicit dependencies that cause interactive executions to silently diverge from clean top-to-bottom runs. Prior approaches either employ dependency analyses or enforce reactive dataflow models that face fundamental tradeoffs among expressiveness, precision, and performance. This paper exploits the insight that reproducibility can be enforced without precise dependency tracking: a notebook is reproducible if and only if executing its cells in top-to-bottom order from an empty store produces exactly the outputs currently recorded. We formalize this notion of reproducibility and present FlowBook, which implements a dynamic analysis that enforces reproducibility by tracking read and write sets at cell boundaries. FlowBook detects stale cells whose recorded outputs may no longer reflect the current notebook state and prevents operations that would violate reproducibility. FlowBook incurs near-imperceptible latency overhead (median: 70 ms).
[HC-23] Automated Interpretability and Feature Discovery in Language Models with Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)内部机制难以解释的问题,特别是如何自动化地发现和解释模型中的关键内部特征(internal features),以提升可解释性(mechanistic interpretability)的深度与可信度。其解决方案的关键在于提出一个自主的多智能体框架,通过两个耦合的循环实现:一是“解释精炼”循环,由智能体生成竞争性假设并通过针对性提示控制和多指标评估进行迭代验证;二是“特征发现”循环,智能体构建激活空间中的k近邻图并基于统计可分性和语义一致性标准筛选候选特征。该方法显著优于一次性自动解释,并能识别出语言特异性及安全相关的特征,同时生成可审计的解释路径,从而实现更清晰、更具可证伪性的模型解释。
链接: https://arxiv.org/abs/2605.01555
作者: Arnau Marin-Llobet,Javier Ferrando
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.
[HC-24] Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead
【速读】:该论文旨在解决生成式 AI 语言技术(Generative AI Language Technologies, AILTs)在多语种医疗场景中应用时存在的临床安全性与公平性问题,尤其是在翻译、文档生成、口译和沟通等任务中,尽管输出流畅但可能隐藏错误、降低可追溯性并导致责任分配不清。解决方案的关键在于推动“以人为本的 AI 语言技术”(Human-Centered AI Language Technology, HCAILT)范式,强调需结合更可靠的模型开发、可问责的社会技术设计、精准的人类监督机制,并加强机器翻译/自然语言处理(MT/NLP)、翻译研究、人机交互(HCI)、临床实践、实施科学及政策制定等跨学科协作,以实现安全、公平且可持续的部署。
链接: https://arxiv.org/abs/2605.01441
作者: Vicent Briva-Iglesias
机构: Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:AI language technologies (AILTs), increasingly enabled by large language models (LLMs), are becoming embedded in multilingual healthcare workflows for translation, rewriting, documentation, interpreting, and messaging in language-discordant settings. Yet fluent output is not the same as clinically safe or equitable communication: performance varies across languages, accents, tasks, and workflows, and efficiency gains can hide errors, reduce traceability, and shift responsibility across clinicians, translators, interpreters, and health systems. This narrative review synthesises recent peer-reviewed evidence across written communication, spoken communication, and emerging agentic workflows. Using the Human-Centered AI Language Technology (HCAILT) lens, it examines capabilities, evaluation practices, implementation patterns, and recurrent errors through reliability, safety culture, and trustworthiness. We identify key convergences and contradictions in the literature and propose seven grand challenges for the next phase of research and deployment. Progress, we argue, requires not only better models but also accountable sociotechnical design, calibrated human oversight, and stronger collaboration across MT/NLP, translation studies, HCI, clinical practice, implementation science, and policy.
[HC-25] AI Expert Twin: Capturing Expert Cognition for Human-Centred Practice-Based Learning
【速读】:该论文旨在解决专家实践中隐性知识(tacit knowledge)难以捕获、形式化与规模化的问题,尤其是在基于实践领域的教育场景中,现有AI驱动的教育系统虽能实现个性化学习、学习者建模和自我调节学习等功能,却较少建模支撑专家实践的核心隐性推理与情境敏感判断。其解决方案的关键在于提出“AI专家孪生”(AI Expert Twin)框架,该框架以认知为中心,将专家知识结构化为三层次表示:程序性动作、语义概念与决策过程,并引入价值导向偏好、权衡关系及不确定性对专家判断的影响机制,从而构建可计算、可迁移的专家认知模型,为AI赋能的教育系统提供可集成的知识基础。
链接: https://arxiv.org/abs/2605.01401
作者: Annie Yuan,Xiaohua Chen,Kalina Yacef,Judy Kay
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:Tacit knowledge embedded in expert practice remains difficult to capture, formalise, and scale. While AI-driven educational systems have advanced personalisation, learner modelling, affective support, and self-regulated learning, they less often model the tacit reasoning and context-sensitive judgement that underpin expert practice in practice-based domains. This paper introduces the AI Expert Twin, a cognition-centric framework that models expert knowledge as structured, computable representations of procedural actions, semantic concepts, and decision processes. The framework also considers how value-laden preferences, trade-offs, and uncertainty shape expert judgement in practice. We formalise expert cognition as a three-layer representation and capture knowledge from experts under this model, laying the groundwork for integration into AI-powered educational system. A case study in a cultural heritage workshop demonstrates the feasibility of the approach in a real-world setting. The framework is designed to be transferable across domains such as vocational education and creative industries. By embedding expert heuristics into AI while maintaining transparency and learner agency, the AI Expert Twin offers a novel path towards scalable, practice-based learning and invites further research on ethical, human-centred applications of AI in education.
[HC-26] What Does a Meow Mean? In Search of Intuitively Understandable Communication by a Nonverbal Companion Robot
【速读】:该论文旨在解决独居老年人在日常生活中面临的挑战,如缺乏陪伴、活动不足及记忆障碍等问题,探索通过具备有限辅助功能的猫形机器人提供非语言沟通信号以增强其生活质量。解决方案的关键在于设计并验证一套结合听觉(猫叫声)与视觉(小显示屏上的图标)的非语言通信信号系统,并采用混合方法学和用户中心设计原则进行评估,结果显示:当视听信号同时存在时,老年用户对机器人意图的理解准确率显著提高;而仅依赖听觉信号时效果不佳,除非机器人表达强烈情感状态(如被抚摸时发出呼噜声),此时听觉信号反而能提升理解准确性。
链接: https://arxiv.org/abs/2605.01251
作者: Vivienne Bihe Chi,Claudia B. Rébola,Bertram F. Malle
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: To appear in the Proceedings of the 18th International Conference on Social Robotics (ICSR 2026)
Abstract:Older adults living alone have a number of challenges, and robots can help with some of them–by providing reminders, initiating activity, or offering comfort. As part of developing a cat robot with limited assistive functions, we designed a set of nonverbal communication signals, both auditory (cat sounds) and visual (icons on a small display). To evaluate these signals we used a mixed-methods, user-centered approach. After a pilot study, a focus group with older adults suggested revisions to the initial signal set. A large-sample online experiment then tested whether adults over the age of 65 could accurately infer the robot’s communicative intentions. When both visual and auditory signals were present, accuracy was high. When visual signals were absent, accuracy often decreased; when auditory signals were absent, accuracy sometimes increased. So the auditory signals were less helpful, except when the robot conveyed strong sentiments (e.g., purring while being petted).
[HC-27] he Garden of Forking Paths: Narrative Arc-Conditioned Gameplay Planning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的程序化游戏生成方法中缺乏对叙事原型(narrative archetypes,如英雄之旅、三幕结构等)显式利用的问题,从而导致生成的游戏叙事缺乏结构一致性与文化普适性。其解决方案的关键在于提出Forking Garden框架,该框架通过两个核心步骤实现:首先生成多样化的独立节点池,每个节点在玩法元素上实现多模态对齐;随后借助基于叙事弧线约束的算法将这些节点组装成一个 dungeon 图结构,确保整个游戏流程符合用户提供的故事主线,从而实现叙事弧线条件下的游戏规划与交互式生成。
链接: https://arxiv.org/abs/2605.01245
作者: Yunge Wen,Chenliang Huang,Hangyu Zhou,Zhuo Zeng,Chun Ming Louis Po,Julian Togelius,Timothy Merino,Sam Earle
机构: New York University(纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Narrative archetypes (e.g., Hero’s Journey, Three-act structure) provide universal story structures that resonate across cultures and media and are important for video game storytelling, yet existing LLM-based methods lack explicit use of these archetypes in procedurally generated games. We propose Forking Garden, a framework for narrative arc-conditioned gameplay planning that generates branching games from user-provided storylines. Our approach first generates a diverse pool of independent nodes, then assembles them into a dungeon graph via arc-guided constraint algorithms, where each node achieves multimodal alignment of gameplay elements. We develop an end-to-end interactive system that instantiates the framework.
[HC-28] EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning
【速读】:该论文旨在解决在线视频学习环境中学习者参与度(engagement)难以精确测量的问题,尤其是在缺乏外部干预的情况下如何通过传感器数据实现对参与度的细粒度估计。其解决方案的关键在于利用可穿戴设备和摄像头采集多种生理与行为信号(如PPG、ECG、EDA、EEG、IMU、心率、体温及眼动追踪等),构建多模态融合模型来预测学习者的实时参与状态。研究通过16名参与者在视频学习场景下的重复自我报告验证了模型的有效性,在跨被试交叉验证中达到平均绝对误差(MAE)为0.81、83.75%的within-1准确率,并显著优于仅依赖统计特征、深度时序模型或大语言模型(LLM)的基线方法。结果表明,轻量级的行为与生理信号组合比全模态传感更适用于实际系统部署。
链接: https://arxiv.org/abs/2605.01238
作者: Zikang Leng,Edan Eyal,Yingtian Shi,Jiaman He,Yaqi Liu,Thomas Plötz
机构: Georgia Institute of Technology (佐治亚理工学院); RMIT University (皇家墨尔本理工大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Engagement, which links to attentional, emotional, and cognitive dimensions, plays an important role in learning. In online and video-based learning environments, learners often need to regulate their own interactions with instructional materials. Measuring and reflecting on engagement can therefore support both learners and adaptive learning systems. In this study, we use wearable and camera-based sensing devices to collect physiological and motion signals, including PPG, ECG, EDA, EEG, IMU, heart rate, temperature, and eye-tracking data, to estimate learner engagement. We conducted a user study with 16 participants in a video-based learning scenario, where participants completed learning tasks and provided repeated in-situ self-reports of engagement through brief probes. We develop and evaluate a system for engagement estimation, compare different sensing modalities, and further analyze the feasibility and effectiveness of multimodal modeling for characterizing learner engagement. Across participant-based cross-validation, our model achieves an MAE of 0.81, 83.75% within-1 accuracy, 73.93% binary accuracy, and 68.45% binary Macro-F1, outperforming sensor-free, statistical, deep temporal, foundation-model, and LLM-based baselines. Our results suggest that fine-grained engagement estimation is feasible but inherently noisy, and that practical systems should prioritize lightweight combinations of behavioral and physiological signals over full multimodal instrumentation. We release the EduGage dataset, including synchronized multimodal sensor signals, probe-aligned momentary engagement labels, video metadata, quizzes, and study materials, to support reproducible research on fine-grained sensor-based engagement modeling in self-guided learning.
[HC-29] Ideological discrepancy between publishers and news content is linked with audience engagement and consensus on Facebook
【速读】:该论文旨在解决在线政治互动中共识与极化现象的形成机制问题,特别是探讨意识形态差异(ideological discrepancy)如何影响用户参与行为及其情绪反应、毒性言论和话题分布等维度。其解决方案的关键在于通过分析巴西总统选举期间Facebook上政治新闻帖子的五维互动特征——包括发布者与内容间的意识形态差异、情感倾向、受众共识度、言论毒性及内容主题——揭示出意识形态差异与互动模式之间存在非线性关系:极端不一致或高度一致均导致共识下降,而极端不一致显著提升言论毒性;同时发现,在高度党派化的发布者中,高毒性言论反而与更高受众共识相关,表明敌对话语可能在强意识形态群体内强化内部认同。
链接: https://arxiv.org/abs/2605.01180
作者: Thiago Magrin,Jordan Kobellarz,Pedro O.S. Vaz-de-Melo,Thiago H. Silva
机构: 未知
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Political news on social media rarely circulates in isolation: audiences actively engage, react, and clash. Whether these interactions reflect agreement or conflict may depend on the ideological discrepancy between publishers and the news content they share. This study investigates this relationship using Facebook posts linking to political news during a Brazilian presidential election. We analyze five dimensions of engagement: ideological discrepancy between publishers and content, emotional responses, audience consensus, toxicity in posts, and content topics. Our results show that ideological discrepancy is associated with differences in engagement, exhibiting a nonlinear pattern: consensus declines under conditions of very high ideological mismatch and, in our data, also under very high alignment, while toxicity increases primarily under extreme mismatch. A statistical model indicates that emotional valence, toxicity, and ideological discrepancy are the factors most strongly associated with consensus. Among highly partisan publishers, higher toxicity is associated with increased audience consensus, suggesting that hostile discourse may co-occur with in-group agreement in strongly ideological contexts. Overall, these findings highlight how ideological discrepancy, emotional reactions, and interaction dynamics are associated with consensus and polarization in online political engagement.
[HC-30] oward a Unified Framework for Collaborative Design of Human-AI Interaction
【速读】:该论文旨在解决当前人机交互(Human Computer Interaction, HCI)中因多模态接口日益依赖人工智能(Artificial Intelligence, AI)进行用户意图解读,而用户却难以理解AI决策过程所导致的信任缺失与控制权丧失问题。现有方法将多模态对齐、可解释性和人类自主性视为独立议题,造成透明度不足和用户监督缺位。其解决方案的关键在于提出一个“人-AI协作框架”(Human Artificial Intelligence Collaboration Framework),将三者作为相互依存的设计要求:1)多模态对齐以确保意图识别的准确性;2)以交互为中心的可解释性提供实时视觉、文本和音频反馈;3)保留人类自主性的机制使用户可在任意时刻接受、拒绝或修改AI建议。该框架通过两个典型场景(协同设计与扩展现实仓库机器人协作)验证了其有效性,强调协作应被视为持续交互属性,从而在AI日益主动化的趋势下保障用户的理解力与控制权始终作为核心设计原则。
链接: https://arxiv.org/abs/2605.01153
作者: Ankur Bhatt,Sven Mayer
机构: TU Dortmund University (多特蒙德工业大学); Research Center Trustworthy Data Science and Security (可信数据科学与安全研究中心)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Human computer interaction is shifting from screen-based systems to multimodal interfaces where artificial intelligence powered systems increasingly interpret user intent through speech, gesture, and gaze. Yet users rarely understand how these interpretations are made, compromising trust and control. Existing approaches treat multimodal alignment, explainability, and human agency as separate concerns, leaving critical gaps in transparency and user oversight. We propose a Human Artificial Intelligence collaboration framework integrating these three principles as interdependent design requirements: 1) multimodal alignment for accurate intent interpretation, 2) interaction centric explainability delivering real time visual, textual, and audio feedback, and 3) agency preserving mechanisms enabling users to accept, reject, or modify artificial intelligence suggestions at any time. We presented the framework through two scenarios, collaborative design and extended reality warehouse robot collaboration, chosen to span differences in time pressure and error reversibility, with the latter situated in a domain where misinterpretation carries documented safety consequences. This approach reframes collaboration as a continuous interaction property, benefiting designers, researchers, and end users by ensuring that as artificial intelligence systems grow more proactive, user understanding and control remain first class design properties.
[HC-31] What Makes an AI Writing Companion a Good Fit? A Personality-Informed Co-Design Study
【速读】:该论文试图解决的问题是:如何基于用户个性特征设计更有效的AI写作助手(AI writing assistant),以提升人机协作中的参与度和协同效率。解决方案的关键在于通过探索性共同设计工作坊,识别不同人格类型写作者的核心需求与偏好,并据此开发出反映不同写作倾向的对比原型,从而引导用户反思其写作实践与AI适配性,最终实现AI伴侣与个体认知及人际需求的精准匹配,增强人-AI团队协作的感知有效性。
链接: https://arxiv.org/abs/2605.01108
作者: Mengke Wu,Kexin Quan,Weizi Liu,Mike Yao,Jessie Chin
机构: University of Illinois Urbana-Champaign School of Information Sciences; University of Illinois Urbana-Champaign Institute of Communications Research; Texas Christian University Bob Schieffer College of Communication
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 11 figures. arXiv admin note: substantial text overlap with arXiv:2509.11115
Abstract:The growing popularity of AI writing assistants creates exciting opportunities to support diverse writers. This study examines how personality shapes expectations for AI writing companions and how personality-informed design can enhance human-AI teaming in writing. Through exploratory co-design workshops with 24 writers representing different personality profiles, we elicited values and design ideas for AI writing companions spanning functionality, interaction dynamics, and visual representation. These insights informed two contrasting prototypes reflecting distinct writing orientations, used as design provocations in review-and-refinement workshops with eight participants to prompt reflection on fit, priorities, and writing practices. Our findings reveal both shared foundational needs across writers and meaningful personality-driven preferences that influence how writers engage with AI. This work underscores the importance of team matching in human-AI collaboration and demonstrates how aligning AI companions with individual cognitive and interpersonal needs can improve engagement and perceived collaboration effectiveness.
[HC-32] RECAP: An End-to-End Platform for Capturing Replaying and Analyzing AI-Assisted Programming Interactions
【速读】:该论文旨在解决如何全面理解开发者与AI编程助手(AI coding assistant)交互行为的问题,现有研究仅依赖聊天记录或Git历史等单一数据源,难以还原完整的交互上下文,如提示词(prompt)与代码修改之间的映射关系、被尝试后舍弃的策略以及开发策略随时间的演变过程。解决方案的关键在于提出RECAP(Replay and Examine Captured AI Programming)平台,其核心能力包括:(1)在不干扰开发者工作流的前提下,被动记录VS Code中的AI对话会话和细粒度代码编辑;(2)将这些异构数据融合为统一的时间线以支持交互式会话回放;(3)提供可扩展的分析层,例如行为分类和AI依赖度测量模块。通过在软件工程课程中部署,RECAP成功捕获了41名学生在多周项目中的2,034条提示和8,239次代码修改,验证了其通过关联数据与回放功能实现开发者-AI交互模式深度分析的能力。
链接: https://arxiv.org/abs/2605.01104
作者: Keyu He,Qianou Ma,Valerie Chen,Wayne Chi,Tongshuang Wu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer’s workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform’s linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.
[HC-33] Non-Markovian Dynamical Systems Modeling of Electroencephalogram-based Brain Activity for Anticipating the Cognitive Fatigue Level
【速读】:该论文旨在解决高风险环境中认知疲劳(cognitive fatigue)导致的性能退化问题,现有黑箱评估技术因忽略大脑非马尔可夫性(non-Markovian)和时变相互依赖特性,难以实现实时相变检测。其解决方案的关键在于提出一种基于分数阶动力学网络的机器学习框架(fractional dynamical networks-based machine learning, FDNML),通过耦合分数阶微分方程刻画脑信号间的复杂依赖关系,并利用多分形特征提取不同疲劳状态下的广义分形维数差异(Wasserstein距离分别为0.10、0.13和0.08),从而实现93.33%分类准确率与95% AUROC的实时神经状态过渡检测,有效预防性能下降。
链接: https://arxiv.org/abs/2605.01043
作者: Zeinabsadat Saghi,Daria Riabukhina,Olubukola Akinbami,Paul Bogdan,Souti Chattopadhyay
机构: University of Southern California(南加州大学); University of Michigan(密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Cognitive fatigue, which transitions from focused attention to inexact responses, can cause catastrophic failures in high-stakes environments, yet current black-box assessment techniques ignore the brain’s non-Markovian and time-varying interdependent properties, limiting real-time phase transition detection. We develop a fractional dynamical networks-based machine learning (FDNML) framework using coupled fractional-order differential equations to capture brain signal interdependencies and detect cognitive fatigue transitions in real-time. Multifractal properties of brain activity exhibit distinct generalized fractal dimension signatures across fatigue levels, with Wasserstein distances of 0.10, 0.13, and 0.08 between states 0-1, 1-2, and 0-2, respectively. The framework achieves 93.33% classification accuracy and 95% AUROC, enabling the prevention of performance degradation through early detection of neural state transitions.
[HC-34] mporal Out-of-Distribution Detection for Asynchronous Motor Imagery Brain-Computer Interfaces
【速读】:该论文旨在解决异步脑-机接口(BCI)中因连续脑电(EEG)流包含非目标状态(如静息态或分布外的运动想象任务,即OOD MI)而导致传统封闭集分类器误判的问题。解决方案的关键在于提出一个两级检测框架:第一阶段采用基于EEGNet的“休息/任务门控”机制,通过滑动窗口持续监测EEG信号以识别当前是否处于任务状态;第二阶段仅对确认的任务状态样本进行已知类别(ID)运动想象(MI)分类与分布外(OOD)检测。为提升OOD识别性能,进一步引入TempDens方法,融合分类输出能量、深度特征密度和时序一致性得分,从输出、特征和时序动态三个维度量化分布偏移,从而实现更鲁棒的在线控制决策。
链接: https://arxiv.org/abs/2605.01014
作者: Chenhao Liu,Siyang Li,Luofei Tan,Dongrui Wu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Real online brain–computer interfaces operate on continuous electroencephalography (EEG) streams, where users are usually at rest and enter motor-imagery task states only intermittently. EEG windows may also arise from OOD MI activity outside the predefined control set. Conventional closed-set motor-imagery classifiers tend to assign such inputs to ID classes, which can cause erroneous control. To address this issue, this paper proposes a two-stage EEG detection framework for asynchronous motor-imagery brain–computer interfaces. A sliding-window mechanism continuously monitors EEG signals. The first stage uses an EEGNet-based rest/task gate to determine whether the current window should enter the control-decision process. The second stage performs ID MI classification and out-of-distribution detection only for task-state samples. To improve OOD rejection, we further propose TempDens, which combines classification-output energy, deep-feature density, and temporal-consistency scores to characterize distributional deviation from output, feature, and temporal-dynamic perspectives. Experimental results show that the proposed method effectively supports task-state detection and OOD MI recognition in continuous EEG streams, outperforming multiple conventional OOD baselines. This study reframes online motor-imagery control as a hierarchical decision problem involving continuous monitoring, state discrimination, ID classification, and OOD rejection.
[HC-35] AnimationDiff: A Visual Comparison Tool for Generated 3D Character Animations
【速读】:该论文旨在解决生成式AI(Generative AI)在创建3D角色动画时带来的新挑战——即如何高效地比较大量生成的动画变体以选择最优方案。由于动画存在时间上的错位(temporal misalignment)以及高维度的空间数据,传统方法难以进行直观、准确的对比。解决方案的关键在于提出AnimationDiff这一可视化比较工具:它通过结合成熟的动画可视化技术,支持在目标场景和摄像机视角下进行上下文感知的比较,并提供叠加与并排两种视图切换机制;同时引入Temporal Lenses(时间透镜)来全局展示动画随时间演变的过程,辅助对齐与比较;此外还集成过滤功能以应对信息过载问题,从而显著提升动画比较的效率与准确性。
链接: https://arxiv.org/abs/2605.01001
作者: Ludwig Sidenmark,Qian Zhou,George Fitzmaurice,Fraser Anderson
机构: University of Toronto (多伦多大学); Autodesk Research (Autodesk 研究院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM DIS 2026
Abstract:Creating 3D character animations traditionally requires significant time and effort from the animator. Advancements in generative methods now enable easy creation of multiple character animation variations for use or further editing. However, this capability introduces a new challenge in comparing character animations to select the best animation, which is challenging due to temporal misalignment and the large amount of spatial data. We present AnimationDiff, a visual comparison tool for generated character animations. AnimationDiff enables contextual comparisons in the intended scene and camera angle, and embedding of spatial information by combining established animation visualization techniques and easy switching between overlaid and side-by-side comparisons. AnimationDiff also supports filtering to handle information overload, and Temporal Lenses that visualize entire animations over time for overview, alignment, and comparison. We evaluated AnimationDiff in a user study, showcasing its efficacy in animation comparison and providing design insights for comparing motion.
[HC-36] 1BT: One-Block Transformer for EEG-Based Cognitive Workload Assessment
【速读】:该论文旨在解决在资源受限环境下实现高精度、连续性认知负荷(Cognitive Workload)评估的难题,尤其针对传统架构在表征能力与计算效率之间难以平衡的问题。其解决方案的关键在于提出一种名为1BT(One-Block Transformer)的紧凑型神经网络结构,该模型通过单一交叉注意力模块结合轻量级自注意力机制,以极小的潜在瓶颈(latent bottleneck)聚合多通道脑电图(EEG)时序数据,在仅需不到0.5百万参数和0.02 GFLOPs的情况下实现了高性能的认知负荷分类,从而为实时脑机交互系统提供了可行的轻量化设计路径。
链接: https://arxiv.org/abs/2605.00856
作者: Stefanos Gkikas,Christian Arzate Cruz,Thomas Kassiotis,Giorgos Giannakakis,Raul Fernandez Rojas,Randy Gomez
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Accurate and continuous estimation of cognitive workload is fundamental to creating adaptive human-machine systems. However, designing architectures that balance representational capacity with computational efficiency has been challenging for practical deployment. This paper introduces 1BT, a One-Block Transformer for compact and efficient EEG-based cognitive workload assessment. The model aggregates multi-channel temporal sequences via a minimal latent bottleneck, using a single cross-attention module followed by lightweight self-attention. A controlled study involving 11 participants performing three cognitively diverse tasks (abstract reasoning, numerical problem-solving, and an interactive video game) was conducted with continuous EEG recordings across two workload levels. Systematic architectural analysis identifies the most compact configuration that preserves high performance, while substantially lowering computational cost. The final model achieves high workload classification performance with under 0.5 million parameters and 0.02 GFLOPs, paving the way for a design direction for real-time cognitive workload monitoring in resource-constrained settings.
计算机视觉
[CV-0] Laplacian Frequency Interaction Network for Rural Thematic Road Extraction
【速读】:该论文旨在解决农业机械运动轨迹图像中农村主题道路网络构建的问题,核心挑战在于现有下采样方法易模糊稀疏的高频道路结构,且密集田间作业产生的噪声常导致提取的道路拓扑碎片化或冗余。解决方案的关键是提出LFINet(Laplacian Frequency Interaction Network),其创新性地引入拉普拉斯多尺度分离器(Laplacian Multi-scale Separator, LMS)将图像解耦为低频语义上下文与高频结构细节;通过交叉频率交互模块(Cross-Frequency Interaction Block, CFIB)中的双路径架构——高频块(High-Frequency Block, HFB)细化局部结构、空间变换器(Spatial Transformer, ST)捕获全局语义——并利用频率门控调制机制(Frequency Gated Modulation, FGM)以语义上下文校准结构细节,最终由渐进式重建解码器融合多尺度特征确保拓扑一致性。
链接: https://arxiv.org/abs/2605.02866
作者: Baiyan Chen,Weixin Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rural thematic road network construction aims to extract topological road structures from movement trajectory images of agricultural machinery. However, this task faces challenges where downsampling methods commonly used in existing studies tend to blur the sparse high-frequency road structures, and the heavy noise from dense field operations often leads to fragmented or redundant topologies in the extracted networks. To address these challenges, we propose LFINet, a Laplacian Frequency Interaction Network. The network begins with a Laplacian Multi-scale Separator (LMS) to decouple the image into low-frequency semantic contexts and high-frequency structural details. These components are then processed by the Cross-Frequency Interaction Block (CFIB) through a dual-pathway architecture in which a High-Frequency Block (HFB) refines local structures while a Spatial Transformer (ST) captures global semantics. Subsequently, a Frequency Gated Modulation (FGM) mechanism integrates the features from pathways by leveraging semantic contexts to calibrate the structural details. Finally, a Progressive Reconstruction Decoder iteratively fuses multi-scale features to ensure topological consistency. Experiments conducted on a real-world agricultural trajectories dataset from Henan Province, China, show that LFINet establishes a new state-of-the-art. Specifically, it achieves an F1-score of 92.54% and an IoU of 86.12%, surpassing the second-ranked method by 0.64% and 1.1%, respectively. This confirms its capability to effectively construct topological road networks from noisy and sparse field data.
[CV-1] Pixel Perfect: Relational Image Quality Assessment with Spatially-Aware Distortions
【速读】:该论文旨在解决传统图像质量评估(Image Quality Assessment, IQA)方法依赖主观平均意见分数(Mean Opinion Score, MOS)所带来的资源消耗大、缺乏可解释性及局部失真定位能力弱的问题。其解决方案的关键在于从绝对质量预测转向关系化与方向性的评估范式:首先利用自监督合成失真引擎生成训练数据,避免人工标注;随后通过对抗对称目标训练失真预测网络,输出空间感知且解耦的失真类型、强度和方向图;最后基于对比学习在有序图像集上训练评分网络,预测相对质量分数,从而实现对图像处理算法的精细化优化。
链接: https://arxiv.org/abs/2605.02863
作者: Fadeel Sher Khan,Long N. Le,Abhinau K. Venkataramanan,Seok-Jun Lee,Hamid R. Sheikh
机构: Samsung Research America(三星研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional image quality assessment (IQA) methods rely on mean opinion scores (MOS), which are resource-intensive to collect and fail to provide interpretable, localized feedback on specific image distortions. We overcome these limitations by shifting from absolute quality prediction to a relational and directional assessment. Our approach utilizes a self-supervised synthetic distortion engine to generate training data, eliminating the need for manual annotation. A distortion prediction network is trained with an anti-symmetric objective to produce spatially-aware, disentangled maps that identify the type, intensity, and direction of distortions relative to a reference image. Subsequently, a scoring network is trained via contrastive learning on ordinally ranked image sets to predict a relational quality score. Our method provides a more granular and interpretable approach to IQA for the targeted optimization of image processing algorithms without requiring any human-labeled quality scores.
[CV-2] Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
【速读】:该论文旨在解决超低码率条件下视频压缩中生成式模型难以有效控制生成过程的问题,即如何在极低比特率下利用生成式先验(generative prior)实现高质量的感知重建。其解决方案的关键在于提出了一种基于扩散模型的视频压缩框架ActDiff-VC,通过将视频划分为可变长度片段,在必要时仅传输关键帧,并使用紧凑的点轨迹集合对时间动态进行建模;随后,条件扩散解码器基于这些稀疏条件信号合成中间帧,从而在严重码率约束下实现感知上真实的重建效果。该方法的核心创新包括内容自适应的关键帧选择机制和预算感知的稀疏轨迹选择机制,共同实现了紧凑且高效的生成式条件控制。
链接: https://arxiv.org/abs/2605.02849
作者: Amirhosein Javadi,Shirin Saeedi Bidokhti,Tara Javidi
机构: University of California San Diego (加州大学圣地亚哥分校); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 11 figures, 3 tables
Abstract:Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE, improves KID by up to 64.6% and FID by up to 37.7% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate–distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.
[CV-3] VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视频理解任务中对领域特定动作识别能力被忽视的问题。随着大规模通用数据集的普及,VLMs 的评估逐渐偏离了动作识别这一视频理解的核心任务,导致其在细粒度、专业场景下的动作识别性能未得到充分挖掘。解决方案的关键在于:首先构建一个涵盖37个领域共1000种独特动作的基准数据集 VideoNet,用于系统性评估VLMs在领域特定动作识别上的表现;其次通过多层级评估设置(从多项选择到少样本提示)揭示模型对上下文示例利用不足的问题;最终提出并构建首个大规模领域特定动作训练数据集(近50万条视频问答对),通过对 Molmo2-4B 模型进行微调,显著提升其在 VideoNet 上的表现,超越所有开源8B参数规模的模型。
链接: https://arxiv.org/abs/2605.02834
作者: Tanush Yadav,Mohammadreza Salehi,Jae Sung Park,Vivek Ramanujan,Hannaneh Hajishirzi,Yejin Choi,Ali Farhadi,Rohun Tripathi,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: project website at this https URL
Abstract:Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide k\in\1,2,3\ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves +7.0% , while Gemini declines -4.8% . Notably, these gains fall short of the +13.6% improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
[CV-4] IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration
【速读】:该论文旨在解决严重退化条件下人脸复原(blind face restoration)中存在的身份信息丢失问题,即在输入图像缺乏关键身份特征时,如何有效恢复高质量且身份一致的人脸。其解决方案的关键在于提出一种统一的参考感知与无参考框架 IConFace,采用身份-结构不对称条件控制机制:将参考图像提炼为归一化加权的全局 AdaFace 身份锚点(identity anchor),用于图像级调制;同时利用低秩残差和双路记忆的块级退化交叉注意力强化退化图像的空间结构锚点(structure anchor)。该设计使得模型在有参考时能精准融合身份信息,在无参考时仍可独立完成高质量复原,从而在身份一致性、细节恢复和纯退化图像复原质量上实现统一提升。
链接: https://arxiv.org/abs/2605.02814
作者: Axi Niu,Jinyang Zhang,Senyan Qing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Blind face restoration is highly ill-posed under severe degradation, where identity-critical details may be missing from the degraded input. Same-identity references reduce this ambiguity, but mismatched pose, expression, illumination, age, makeup, or local facial states can lead to overuse of reference appearance. We propose \textbfIConFace, a unified reference-aware and no-reference framework with identity–structure asymmetric conditioning. References are distilled into a norm-weighted global AdaFace identity anchor for image-only modulation, while the degraded image is reinforced as the spatial structure anchor through low-rank residuals and block-wise degraded cross-attention with two-route memory. The resulting single checkpoint exploits references when available and falls back to no-reference restoration when absent, improving identity consistency, fine-detail recovery, and degraded-only restoration quality in a unified model.
[CV-5] Edge-Efficient Image Restoration: Transformer Distillation into State-Space Models
【速读】:该论文旨在解决图像修复任务中模型推理效率与恢复质量之间的权衡问题,特别是在边缘设备上部署时因Transformer结构自注意力机制带来的高延迟问题。其核心解决方案是提出一种模块化混合框架,通过轻量级状态空间模型(State-Space Model, SSM)块作为特征蒸馏后的Transformer块替代品,构建具有高效性的U-Net类架构,并引入高效的多目标网络搜索策略(Efficient Network Search, ENS),自动发现兼顾任务性能和运行时效率的最优混合结构配置,从而在不依赖硬件反复测试的情况下实现低延迟、高质量的图像修复。
链接: https://arxiv.org/abs/2605.02794
作者: Srinivas Soumitri Miriyala,Sowmya Vajrala,Sravanth Kodavanti,Vikram Nelvoy Rajendiran,Sharan Kumar Allur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:We propose a modular framework for hybrid image restoration that integrates transformer and state-space model (SSM) blocks with a focus on improving runtime efficiency on edge hardware. While transformers provide strong global modeling through self-attention, their attention kernels incur substantial latency on mobile devices, especially for high-resolution inputs. In contrast, SSMs such as Mamba offer lineartime sequence modeling with lower runtime overhead but may underperform on fine grained restoration tasks. To balance accuracy and efficiency, we train lightweight SSM blocks as feature-distilled surrogates of transformer blocks and use them to construct hybrid U-Net-style architectures. To automatically discover effective block combinations, we introduce Efficient Network Search (ENS), a multi-objective search strategy that selects task-specific hybrid configurations from pre-aligned components. ENS optimizes restoration quality while penalizing transformer usage, serving as a lightweight proxy for latency and enabling architecture discovery without repeated hardware profiling. On a Snapdragon 8 Elite CPU, the Restormer baseline requires 10119.52 ms for inference. In contrast, ENS-discovered hybrids significantly reduce runtime: ENS-Deblurring runs in 2973 ms (3.4x faster), ENS-Deraining in 5816 ms (1.74x faster), and ENS-Denoising in 8666 ms (1.17x faster), while maintaining competitive restoration quality.
[CV-6] HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
【速读】:该论文旨在解决从视频中准确恢复人体姿态与外观的问题,现有方法在三维几何重建上存在不足:基于ViT的方法可靠性不一致且易过拟合二维视角,而基于NeRF或高斯泼溅(Gaussian Splatting)的Avatar模型将姿态与外观分离,限制了新视角和新姿态下的渲染泛化能力。解决方案的关键在于提出HumanSplatHMR,一个联合优化框架,通过闭合几何姿态估计与可微渲染之间的反馈回路,在优化3D人体姿态的同时学习高保真的人体Avatar,从而实现新颖视角和新颖姿态下的高质量合成。该方法不依赖于运动捕捉系统或离线精调,仅使用先进的姿态估计算法输出的初始人体网格,并通过反向传播图像级损失(光度、分割和深度损失)至姿态参数与全局位置,逐步提升姿态精度与渲染质量。
链接: https://arxiv.org/abs/2605.02784
作者: Yeheng Zong,Pou-Chun Kung,Yike Pan,Seth Isaacson,Yizhou Chen,Ram Vasudevan,Katherine A. Skinner
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.
[CV-7] Linearizing Vision Transformer with Test-Time Training ICML2026
【速读】:该论文旨在解决线性复杂度注意力机制(linear-complexity attention mechanisms)在从头训练时计算成本过高,以及无法有效继承预训练Transformer模型权重的问题。其核心挑战在于Softmax注意力与线性注意力之间存在本质的表征差异(representational gap),导致权重迁移困难。解决方案的关键在于两个层面的对齐:一是架构对齐,识别出Test-Time Training(TTT)这一具有两层动态结构的线性复杂度架构,其形式上与Softmax注意力一致,从而可直接继承预训练权重;二是表征对齐,引入键实例归一化(key instance normalization)和轻量级局部性增强模块,以匹配关键的表征属性如键平移不变性(key shift-invariance)和局部性(locality)。通过该方法,作者将Stable Diffusion 3.5线性化为SD3.5-T⁵,在仅用4块H20 GPU进行1小时微调后,即可达到与Fine-tuned Softmax模型相当的文本到图像质量,同时推理速度提升1.32×至1.47×(在1K和2K分辨率下)。
链接: https://arxiv.org/abs/2605.02772
作者: Yining Li,Dongchen Han,Zeyu Liu,Hanyi Wang,Yulin Wang,Gao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T ^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4 \times H20 GPUs, SD3.5-T ^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32 \times and 1.47 \times at 1K and 2K resolutions.
[CV-8] OC-SR: Task-Optimal Compact diffusion for Image Super Resolution
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像超分辨率(Super-Resolution, SR)任务中因模型规模庞大和迭代采样过程复杂而导致计算开销过高、难以实际部署的问题。解决方案的关键在于:首先通过特征级生成蒸馏(feature-wise generative distillation)构建参数高效的替代模块,并利用ε约束贝叶斯优化(epsilon-constrained Bayesian Optimization)进行架构搜索,在最小化模型复杂度的同时保持生成保真度,从而获得一个紧凑的扩散骨干网络;随后将该骨干网络适配至超分辨率任务,并将其扩散过程蒸馏为单步生成器,最终实现高效且高质量的图像超分辨率重建。
链接: https://arxiv.org/abs/2605.02767
作者: Sowmya Vajrala,Akshay Bankar,Manjunath Arveti,Shreyas Pandith,Sravanth Kodavanti,Subhajit Sanyal,Amit Unde,Srinivas Soumitri Miriyala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for building efficient one-step super-resolution models by first discovering a compact diffusion backbone. Starting from a sixteen-channel latent diffusion model, we construct parameter-efficient surrogate blocks using feature-wise generative distillation and perform architecture discovery using epsilon-constrained Bayesian Optimization to minimize model complexity while preserving generative fidelity. The resulting compact diffusion backbone achieves a 6.6x reduction in parameters and a 2.8x reduction in GMACs compared to the expanded diffusion model. We then adapt this backbone for super-resolution and distill the diffusion process into a single-step generator. Experiments demonstrate that the proposed approach enables efficient super-resolution while maintaining strong reconstruction quality.
[CV-9] FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
【速读】:该论文旨在解决轻量化语义分割模型在计算资源受限条件下难以有效识别和增强困难区域(如细长结构和物体边界)的问题。解决方案的关键在于提出一种基于区域聚焦的推理机制:通过一个可学习的重要度图(importance map)与Top-K激活机制,选择性地强化具有信息量的难分区域;同时利用多尺度卷积分支实现不同感受野下的空间上下文聚合,从而在不依赖复杂全局建模的前提下提升模型对挑战性区域的一致性和准确性。
链接: https://arxiv.org/abs/2605.02764
作者: Hsin-Jui Pan,Sheng-Wei Chan,Meng-Qian Li,Chun-Po Shen
机构: Taiwan Kongku University (台湾高雄大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 2 tables. Efficient semantic segmentation under resource-constrained settings. Code will be released
Abstract:We present FoR-Net, a lightweight architecture for semantic segmentation that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi-scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR-Net on the Cityscapes benchmark under limited computational resources. Despite its lightweight design and standard training configuration, FoR-Net achieves competitive performance and demonstrates improved consistency in challenging regions. These results suggest that region-focused reasoning provides a simple yet effective inductive bias for efficient semantic segmentation.
[CV-10] Unified Map Prior Encoder for Mapping and Planning ICRA2026
【速读】:该论文旨在解决自动驾驶中在线地图构建与端到端(End-to-End, E2E)规划任务长期依赖传感器数据、忽视多源异构地图先验(包括高精/标准矢量地图、栅格化标准地图及卫星影像)的问题,这些先验因存在异质性、位姿漂移和测试时可用性不一致而难以有效利用。解决方案的核心是提出统一的地图先验编码器(Unified Map Prior Encoder, UMPE),其关键创新在于:1)设计双分支结构,分别对矢量地图进行帧级SE(2)校正与多频正弦特征编码,并通过置信度加权的交叉注意力融合至BEV特征;2)对栅格地图采用FiLM条件化的ResNet-18骨干网络,实现微对齐与零初始化残差融合,确保模型从“无害基线”出发仅学习有益先验信息;3)采用矢量→栅格的融合顺序以体现几何优先、外观次之的归纳偏置,从而在nuScenes和Argoverse2数据集上显著提升地图感知精度(mAP提升达+5.9至+4.1)并降低端到端规划轨迹误差(L2误差减少0.30 m)与碰撞率(下降0.10%),且具备幂集鲁棒性——即使训练时使用全部先验,在测试阶段仅提供单一先验仍能优于单先验模型。
链接: https://arxiv.org/abs/2605.02762
作者: Zongzheng Zhang,Sizhe Zou,Guantian Zheng,Zhenxin Zhu,Yu Gao,Guoxuan Chi,Shuo Wang,Yuwen Heng,Zhigang Sun,Yiru Wang,Hao Sun,Chao Ma,Zhen Li,Anqing Jiang,Hao Zhao
机构: Institute for AI Industry Research (AIR), Tsinghua University; Bosch Corporate Research, China; Shanghai Jiao Tong University; Chinese University of Hong Kong, Shenzhen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by ICRA 2026
Abstract:Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM with scaling and shift at every stage, performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion, so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, UMPE lifts MapTRv2 from 61.5 to 67.4 mAP (+5.9) and MapQR from 66.4 to 71.7 mAP (+5.3). On Argoverse2, UMPE adds +4.1 mAP over strong baselines. UMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning with the VAD backbone on nuScenes, UMPE reduces trajectory error from 0.72 to 0.42 m L2 on average (-0.30 m) and collision rate from 0.22% to 0.12% (-0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.
[CV-11] DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation KR
【速读】:该论文旨在解决传统同时定位与建图(SLAM)算法在动态环境中应用受限的问题,特别是当环境中存在行人等移动实体时,因假设环境静态而导致的定位与地图构建失效。其解决方案的关键在于提出一种紧耦合的动态图SLAM架构——DynoSLAM,该架构将社会感知的图神经网络(GNN)直接嵌入因子图优化中,通过训练好的GNN进行蒙特卡洛滚动预测,以捕捉人类交互的多模态认知不确定性,并将其以动态马氏距离因子的形式融入SLAM图中,从而实现对行人未来状态的概率建模与安全避障规划。
链接: https://arxiv.org/abs/2605.02759
作者: Danil Tokhchukov,Veronika Morozova,Gonzalo Ferrer
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Code Project page at this https URL
Abstract:Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model. By utilizing Monte Carlo rollouts from a trained GNN, we capture the multimodal epistemic uncertainty of human interactions and embed it into the SLAM graph via a dynamic Mahalanobis distance factor. We demonstrate through extensive simulated experiments that this stochastic formulation not only maintains highly accurate retrospective tracking but also prevents the optimization failures caused by the deterministic “argmax problem”. Ultimately, extracting the empirical mean and covariance matrices of future pedestrian states provides a mathematically rigorous, probabilistic safety envelope for downstream local planners, enabling anticipatory and collision-free robot navigation in densely crowded environments.
[CV-12] Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation ICML2026
【速读】:该论文旨在解决模拟数据在视觉语言动作(Vision-Language-Action, VLA)模型训练中因视觉域差距(visual domain gap)和环境多样性不足而导致的真实世界泛化能力弱的问题。其核心解决方案是提出一种高效的视频增强框架,通过结构化条件提取(如视频语义分割和视频描述生成)、环境多样化重写文本以及条件视频迁移模型,将模拟VLA视频转化为真实感强且保留任务语义与动作轨迹的训练视频。关键创新在于引入扩散特征复用机制以加速生成过程,并采用coreset采样策略在计算资源受限下选取非冗余子集进行高效增强,从而显著提升模型在多个基准(如Robotwin 2.0、LIBERO系列及真实机器人平台)上的性能表现。
链接: https://arxiv.org/abs/2605.02757
作者: Chenyu Hui,Xiaodi Huang,Siyu Xu,Yunke Wang,Shan You,Fei Wang,Tao Huang,Chang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICML 2026
Abstract:Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT-1B by 8% on Robotwin 2.0, and boosts \pi_0 by 5.1% on the more challenging LIBERO-Plus benchmark. Code is available at: this https URL.
[CV-13] Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
【速读】:该论文旨在解决当前开放世界文本引导的类别无关计数(Class-Agnostic Counting, CAC)模型在实际应用中存在语义理解与视觉表征不一致的问题,即模型难以准确根据自然语言提示(textual prompt)定位应计数的目标对象类别,导致虚假计数响应和可靠性下降。其解决方案的关键在于提出两个核心贡献:一是构建PrACo++(Prompt-Aware Counting++)测试套件,包含负标签测试(negative-label test)和干扰物测试(distractor test)两种新评估协议及专用指标,用于系统性评估模型对提示语义的理解能力;二是发布MUCCA(MUlti-Category Class-Agnostic counting)数据集,该数据集包含每张图像中多个标注类别的真实场景图像,突破了以往CAC基准仅支持单类图像的局限,从而更贴近现实复杂场景。实验表明,尽管现有模型在标准计数指标上表现优异,但在语义对齐层面存在显著缺陷,验证了新框架的有效性和必要性。
链接: https://arxiv.org/abs/2605.02752
作者: Giacomo Pacini,Luca Ciampi,Nicola Messina,Nicola Tonellotto,Giuseppe Amato,Fabrizio Falchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL
Abstract:Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations. This limitation leads to spurious counting responses and reduced reliability in real-world scenarios. To systematically address these limitations, we propose a new evaluation framework focused on model robustness and trustworthiness. Our contribution is two-fold: (i) we introduce PrACo++ (Prompt-Aware Counting++), a novel test suite featuring two dedicated evaluation protocols – the negative-label test and the distractor test – paired with new specialized metrics; and (ii) we present the MUCCA (MUlti-Category Class-Agnostic counting) evaluation dataset, a new collection of real-world images featuring multiple annotated object categories per scene, unlike existing CAC benchmarks that typically include a single category per image. Our extensive experimental evaluation of 10 state-of-the-art methods shows that, despite strong performance under standard counting metrics, current models exhibit significant weaknesses in understanding and grounding object class descriptions. Finally, we provide a quantitative analysis of how semantic similarity between prompts influences these failures. Overall, our results underscore the need for more semantically grounded architectures and offer a reliable framework for future assessment in open-world text-guided CAC methods.
[CV-14] Virtual Scanning for NSCLC Histology: Investigating the Discriminatory Power of Synthetic PET
【速读】:该论文旨在解决非小细胞肺癌(NSCLC)中腺癌(ADC)与鳞状细胞癌(SCC)组织学亚型精准区分的问题,尤其是在物理正电子发射断层扫描(PET)不可用或受限的临床场景下,如何利用合成代谢特征增强影像分类性能。其解决方案的关键在于提出一种基于3D Pix2Pix生成对抗网络(GAN)的“虚拟扫描”策略,通过预训练模型从解剖结构CT图像中合成伪PET(pseudo-PET)体积,并将其与原始CT数据在MINT多阶段中间融合框架中进行融合,从而引入互补的代谢信息以提升深度学习模型对肿瘤组织类型的判别能力。
链接: https://arxiv.org/abs/2605.02746
作者: Fatih Aksu,Laura Ciuffetti,Francesco Di Feola,Filippo Ruffini,Giulia Romoli,Fabrizia Gelardi,Arturo Chiti,Valerio Guarrasi,Paolo Soda
机构: Università Campus Bio-Medico di Roma(罗马生命与健康大学); Umeå University(于默奥大学); Università Vita-Salute San Raffaele(圣拉斐尔生命科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate histological differentiation between adenocarcinoma (ADC) and squamous cell carcinoma (SCC) is critical for personalized treatment in non-small cell lung cancer (NSCLC). While [ ^18 F]FDG PET/CT is a standard tool for the clinical evaluation of lung cancer, its utility is often limited by high costs and radiation exposure. In this paper, we investigate the feasibility of “virtual scanning” as a feature-enhancement strategy by evaluating whether synthetic PET data can provide complementary feature representations to supplement anatomical CT scans in histological subtype classification. We propose a framework that leverages a 3D Pix2Pix Generative Adversarial Network (GAN), pretrained on the FDG-PET/CT Lesions dataset, to synthesize pseudo-PET volumes from anatomical CT scans. These synthetic volumes are integrated with structural CT data within the MINT framework, a multi-stage intermediate fusion architecture. Our experiments, conducted on a multi-center dataset of 714 subjects, demonstrate that the inclusion of synthetic metabolic features significantly improves classification performance over a CT-only baseline. The multimodal approach achieved a statistically significant increase in the Area Under the Curve (AUC) from 0.489 to 0.591 and improved the Geometric Mean (GMean) from 0.305 to 0.524. These results suggest that synthetic PET scans provide discriminatory metabolic cues that enable deep learning models to exploit complementary cross-modal information, offering a potential feature-enhancement strategy for clinical scenarios where physical PET scans are unavailable. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.02746 [cs.CV] (or arXiv:2605.02746v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.02746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-15] SIAM: Head and Brain MRI Segmentation from Few High-Quality Templates via Synthetic Training
【速读】:该论文旨在解决现有生成式训练方法在脑部磁共振成像(MRI)分割中依赖大量自动标注模板所引入系统性偏差、灵活性不足的问题,尤其难以扩展至新解剖结构的自动化分割。其解决方案的关键在于提出Segment It All Model (SIAM),通过将领域随机化(domain randomization)同时扩展至强度域和形状域:一方面利用合成图像生成实现对比度多样性,另一方面借助高分辨率空间变换建模皮层厚度与深部核团形态的解剖差异;此外,SIAM首次实现了脑内及脑外组织(包括脑脊液、血管、硬膜、颅骨和皮肤)的联合分割,从而支持无需预处理的全自动分析流程。
链接: https://arxiv.org/abs/2605.02737
作者: Romain Valabregue,Ines Khemir,Eric Badinet,François Rousseau,Guillaume Auzias,Reuben Dorent
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic training has recently advanced brain MRI segmentation by enabling contrast-agnostic models trained entirely on generated data. However, most existing approaches rely on hundreds of automatically labeled templates, introducing systematic biases and limiting their flexibility to incorporate new anatomical structures. We present the Segment It All Model (SIAM), a 3D whole-head segmentation framework for 16 anatomical structures, trained using only six high-quality, manually annotated templates. SIAM extends domain randomization to both intensity and shape domains: synthetic image generation ensures contrast variability, while high-resolution spatial transformations model anatomical differences in cortical thickness and deep nuclei morphology. Unlike prior synthetic models, SIAM simultaneously segments brain as well as extra-cerebral tissues, including cerebrospinal fluid, vessels, dura mater, skull, and skin, enabling fully automated, preprocessing-free analysis. Evaluation across eight heterogeneous datasets (N=301), that include multiple contrasts (T1-weighted, T2-weighted, CT) and span a wide range of ages, demonstrates that SIAM matches or outperforms state-of-the-art methods for brain structures, in addition to extending automated segmentation to non-brain structures. The model also exhibits superior consistency across contrasts and repeated acquisitions, together with improved sensitivity to subtle gray matter atrophy. We openly release the model and the label templates at this https URL.
[CV-16] Perceptual Flow Network for Visually Grounded Reasoning ICML2026
【速读】:该论文旨在解决大型视觉语言模型(Large-Vision Language Models, LVLMs)在训练过程中因采用通用优化目标(如标准最大似然估计,MLE)而导致的视觉轨迹约束不足问题,进而引发的语言偏置(language bias)和幻觉(hallucination)现象。现有方法通常引入来自视觉专家的几何先验作为额外监督信号以缓解此问题,但此类监督往往过于强调几何精度而缺乏推理实用性。论文提出感知流网络(Perceptual Flow Network, PFlowNet),其核心创新在于将感知与推理解耦,构建自条件生成机制,并通过变分强化学习融合多维奖励与邻域几何塑形(vicinal geometric shaping),从而引导以推理为导向的感知行为,同时保持视觉可靠性。该方案不仅具备可证明的性能保障,还在V* Bench和MME-RealWorld-lite等基准上取得了新的SOTA结果。
链接: https://arxiv.org/abs/2605.02730
作者: Yangfu Li,Yuning Gong,Hongjian Zhan,Teng Li,Yuanhuiyi Lyu,Tianyi Chen,Qi Liu,Ziyuan Huang,Zhihang Zhong,Dandan Zheng,Yue Lu
机构: ECNU(华东师范大学); SCU(四川大学); HKUST(香港科技大学); SJTU(上海交通大学); Ant Group(蚂蚁集团); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages with 17 figures, Accepted at ICML 2026
Abstract:Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
[CV-17] OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis
【速读】:该论文旨在解决当前眼科人工智能(AI)模型在临床实践中存在的两大关键问题:一是现有模型多局限于单模态推理,难以模拟医生通过融合多种影像模态进行综合诊断的临床实际;二是高精度AI模型在资源有限环境中部署受限,常因缺乏先进的三维成像硬件而无法应用。解决方案的核心在于提出一种名为OphMAE的多模态基础模型,其创新性地整合了三维光学相干断层扫描(OCT)的深度信息与二维横断面OCT的平面上下文,并采用新颖的跨模态融合架构和自适应推理机制,在包含183,875对配对OCT图像的大规模数据集上预训练,最终在17项眼科诊断任务中展现出卓越性能,且具备强鲁棒性和低样本依赖性,显著提升了眼科AI在多样化场景下的可扩展性与实用性。
链接: https://arxiv.org/abs/2605.02714
作者: Tienyu Chang,Zhen Chen,Renjie Liang,Jinyu Ding,Jie Xu,Sunu Mathew,Amir Reza Hajrasouliha,Andrew J. Saykin,Ruogu Fang,Yu Huang,Jiang Bian,Qingyu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 10 figures, 1 table
Abstract:The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware. Here, we present the Ophthalmic multimodal Masked Autoencoder (OphMAE), a multi-imaging foundation model engineered to synergize the volumetric depth of 3D Optical Coherence Tomography (OCT) with the planar context of 2D en face OCT. By implementing a novel cross-modal fusion architecture and a unique adaptive inference mechanism, OphMAE was pre-trained on a massive dataset with of 183,875 paired OCT images derived from 32,765 patients. In a rigorous benchmark encompassing 17 diverse diagnostic tasks with 48,340 paired OCT images from 8,191 patients, the model demonstrated state-of-the-art performance, achieving an Area Under the Curve (AUC) of 96.9% for Age-related Macular Degeneration (AMD) and 97.2% for Diabetic Macular Edema (DME), consistently surpassing existing single-modal and multimodal foundation models. Crucially, OphMAE exhibits robust engineering adaptability: it maintains high diagnostic accuracy, such as 93.7% AUC for AMD, even when restricted to single-modality 2D inputs, and demonstrates exceptional data efficiency by retaining 95.7% AUC with as few as 500 labeled samples. This work establishes a scalable and adaptable framework for ophthalmic AI, ensuring robust performance across different tasks.
[CV-18] mporally Consistent Object 6D Pose Estimation for Robot Control
【速读】:该论文旨在解决单视角RGB图像下物体位姿估计方法在机器人视觉反馈控制中缺乏时序一致性和鲁棒性的问题。现有商用方法虽具备较高的精度和效率,但在动态环境中易受噪声和异常值影响,难以满足稳定闭环控制的需求。其解决方案的关键在于提出一种基于因子图(factor graph)的在线优化框架,通过引入物体运动模型、显式估计位姿测量不确定性,并将二者融合于实时优化过程中,从而实现时序一致性约束与误差平滑,显著提升了位姿估计的稳定性与准确性,实验证明该方法在标准基准测试和基于力矩控制机械臂的跟踪任务中均表现出优越性能。
链接: https://arxiv.org/abs/2605.02708
作者: Kateryna Zorina,Vojtech Priban,Mederic Fourmy,Josef Sivic,Vladimir Petrik
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克技术大学布拉格信息学、机器人学与控制论研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates. In particular, the proposed approach: (i) incorporates object motion models, (ii) explicitly estimates the object pose measurement uncertainty, and (iii) integrates the above two components in an online optimization-based estimator. We demonstrate that with appropriate outlier rejection and smoothing using the proposed factor graph approach, we can significantly improve the results on standardized pose estimation benchmarks. We experimentally validate the stability of the proposed approach for a feedback-based robot control task in which the object is tracked by the camera attached to a torque controlled manipulator.
[CV-19] SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT
【速读】:该论文旨在解决深度学习(Deep Learning, DL)在光学相干断层扫描(Optical Coherence Tomography, OCT)图像分析中“黑箱”特性带来的临床信任与监管审批难题,特别是现有事后可解释人工智能(Explainable AI, XAI)方法在细粒度病灶结构划分、解剖边界尊重及噪声抑制方面的不足。解决方案的关键在于提出一种结构感知可解释学习(Structure-Aware Interpretable Learning, SAIL)框架,该框架在表示层面融合视网膜解剖先验信息,并通过特征融合设计将其与语义特征耦合,从而在不修改标准后处理XAI方法的前提下,生成更清晰且符合解剖结构的归因图,显著提升解释的临床意义和可信度。
链接: https://arxiv.org/abs/2605.02707
作者: Tienyu Chang,Tianhao Li,Ruogu Fang,Jiang Bian,Yu Huang
机构: Indiana University (印第安纳大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Florida (佛罗里达大学); Indiana University School of Medicine (印第安纳大学医学院); Regenstreif Institute (雷根斯特里夫研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 9 tables
Abstract:Optical coherence tomography (OCT), a commonly used retinal imaging modality, plays a central role in retinal disease diagnosis by providing high-resolution visualization of retinal layers. While deep learning (DL) has achieved expert-level accuracy in OCT-based retinal disease detection, its “black box” nature poses challenges for clinical adoption, where explainability is essential for clinical trust and regulatory approval. Existing post-hoc explainable AI (XAI) methods often struggle to delineate fine-grained lesion structures, respect anatomical boundaries, or suppress noise, limiting the trustworthiness of their explanations. To bridge these gaps, we propose a Structure-Aware Interpretable Learning (SAIL) framework that integrates retinal anatomical priors at the representation level and couples them with semantic features via a fusion design. Without modifying standard post-hoc explainability methods, this representation yields sharper and more anatomically aligned attribution maps. Comprehensive experiments on diverse OCT datasets demonstrate that our structure-aware method consistently enhances interpretability, producing clinically meaningful and anatomy-aware explanations. Ablation studies further show that strong interpretability requires both structural priors and semantic features, and that properly fusing the two is critical to achieve the best explanation quality. Together, these results highlight structure-aware representations as a key step toward reliable explainability in OCT.
[CV-20] Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions
【速读】:该论文旨在解决机器人操作中基于数据驱动方法学习物体动力学模型时存在的效率低和物理可行性差的问题,尤其是在处理柔性物体(如绳子、布料和充气玩具)时更为显著。其关键解决方案在于提出了一种名为PIEGraph的新方法,该方法通过结合解析物理模型与数据驱动的图神经网络来实现高精度的动力学预测:一方面利用弹簧-质量系统构建物理信息感知的粒子模型以确保运动的物理合理性;另一方面设计了一个具有新颖动作表示的等变图神经网络,以利用粒子间相互作用的对称性来引导解析模型参数更新。这种协同机制使得PIEGraph仅需少量真实交互数据即可实现对刚体和柔体对象的有效动力学建模,并在仿真和机器人硬件上验证了其在重定向和重新定位任务中的优越性能。
链接: https://arxiv.org/abs/2605.02699
作者: Sergio Orozco,Tushar Kusnur,Brandon May,George Konidaris,Laura Herlant
机构: Brown University (布朗大学); Robotics and AI (RAI) Institute (机器人与人工智能研究所); General Motors (通用汽车)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 8 figures
Abstract:Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn. We introduce PIEGraph, a novel approach to combining analytical physics and data-driven models to capture object dynamics for both rigid and deformable bodies using limited real-world interaction data. PIEGraph consists of two components: (1) a \textbfPhysically \textbfInformed particle-based analytical model (implemented as a spring–mass system) to enforce physically feasible motion, and (2) an \textbfEquivariant \textbfGraph Neural Network with a novel action representation that exploits symmetries in particle interactions to guide the analytical model. We evaluate PIEGraph in simulation and on robot hardware for reorientation and repositioning tasks with ropes, cloth, stuffed animals and rigid objects. We show that our method enables accurate dynamics prediction and reliable downstream robotic manipulation planning, which outperforms state of the art baselines.
[CV-21] AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
【速读】:该论文旨在解决现有单目深度估计方法在透明、镜面及一般非朗伯(non-Lambertian)表面场景中因传感器误差导致的深度预测不准确问题,以及大尺度单目深度模型虽具备强结构先验但缺乏真实度量尺度(metric scale)限制其在机器人操作中的直接应用问题。解决方案的关键在于提出一种无需训练的深度锚定(depth grounding)框架,通过因子图优化将来自深度基础模型(depth foundation model)的单目深度预测与原始传感器深度进行逐块仿射对齐(patch-wise affine alignment),从而在保持精细几何结构和边界连续性的前提下,实现局部度量空间下的深度校准,显著提升深度精度且无需重新训练。
链接: https://arxiv.org/abs/2605.02667
作者: Simon Dorer,Martin Büchner,Nick Heppert,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); Zuse School ELIZA (祖塞学校 ELIZA); Carl Zeiss Foundation (卡尔·蔡司基金会); Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) (康拉德·祖塞卓越学习与智能系统学校 (ELIZA)); DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence (DAAD 康拉德·祖塞人工智能卓越学校项目); Federal Ministry of Education and Research (联邦教育与研究部)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 Figures, 3 Tables
Abstract:Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training. We make our implementation publicly available at this https URL.
[CV-22] Human Activity Recognition Method for Moderate Violence Detection
【速读】:该论文旨在解决公共空间中轻微身体暴力(如推搡)的实时检测问题,此类行为常为更严重暴力事件的前兆。解决方案的关键在于融合先进的计算机视觉技术:利用YOLO11进行人体检测、YOLO11-Pose提取骨骼关键点,并通过计算躯干倾斜角度与肩髋关节夹角,构建随机森林分类器以区分正常行为与攻击性肢体接触。该方法在不同难度场景下均表现出良好性能,尤其在复杂现实监控视角下仍保持较高精度(0.72),验证了基于骨骼分析实现早期暴力干预的可行性。
链接: https://arxiv.org/abs/2605.02659
作者: Luis Angel Aparicio Borjas,Victor Elias Nieto,Juan Irving Vasquez,Alfonso Fernandez-Vazquez,Gerardo Antonio Alvarez Hernandez
机构: ESCOM-IPN, Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical violence in public spaces is a significant public health concern, with minor incidents such as pushing often serving as precursors to more severe escalations. This research develops an automated system for the real-time detection of moderate physical violence, specifically pushing, in surveillance camera footage. The proposed solution integrates state-of-the-art computer vision models, utilizing YOLO11 and YOLO11-Pose for human detection and skeletal keypoint extraction. By calculating body inclination and joint angles between shoulders and hips, a Random Forest classifier was trained to distinguish between normal behavior and aggressive physical contact. The system’s performance was evaluated through three progressive case studies representing increasing levels of difficulty. In controlled environments with frontal views, the model achieved a precision of 0.98. In the most challenging scenario, featuring high-altitude, steep-angle recordings from real-world surveillance infrastructure, the system maintained a precision of 0.72 despite significant perspective distortion and visual noise. These results demonstrate the feasibility of using skeletal analysis for early violence intervention in urban security contexts.
[CV-23] Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
【速读】:该论文旨在解决多模态理解与生成任务中模型参数效率低、推理速度慢以及视频编辑质量难以兼顾的问题。其核心解决方案是提出Mamoda2.5,一个统一的AR-Diffusion框架,通过在扩散Transformer(Diffusion Transformer)骨干网络中引入细粒度的专家混合(Mixture-of-Experts, MoE)结构(128个专家,Top-8路由机制),实现仅激活3B参数即可达到25B参数模型的生成能力,从而显著降低训练成本并提升模型容量;同时,结合联合少步数蒸馏与强化学习(reinforcement learning)方法,将原30步视频编辑模型压缩为4步模型,使推理速度相比开源基线提升最高达95.9倍,在保持顶级生成性能的同时大幅加速实际应用部署。
链接: https://arxiv.org/abs/2605.02641
作者: Yangming Shi,Shixiang Zhu,Tao Shen,Zhimiao Yu,Dengsheng Chen,Taicai Chen,Yunfei Yang,Juan Zhou,Chen Cheng,Liang Ma,Xibin Wu,Benxuan Yan,Ge Li,Tuoyu Zhang,Dan Li,Chang Liu,Zhenbang Sun
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model’s generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to 95.9\times faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.
[CV-24] ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking
【速读】:该论文旨在解决跨视角指代表多目标跟踪(Cross-view Referring Multi-Object Tracking, CRMOT)任务中对昂贵的帧级空间标注和跨视角身份监督的强依赖问题,从而实现弱监督下的高效跟踪。其解决方案的关键在于:首先利用基础模型(如SAM3)生成可靠的伪标签(pseudo-labels),并通过设计一种基于亲和力引导的跨视角重提示策略来优化和关联跨相机的轨迹片段,以获得高质量的伪监督信号;其次提出ViewSAM模型,该模型基于SAM2构建,通过将视图诱导的差异建模为可学习条件,显式地融合视图感知的跨模态语义,从而在仅需约10%额外参数的情况下实现鲁棒的跨视角指代表跟踪,显著提升了弱监督场景下的性能表现。
链接: https://arxiv.org/abs/2605.02638
作者: Jiawei Ge,Xintian Zhang,Jiuxin Cao,Bo Liu,Fabian Deuser,Chang Liu,Gong Wenkang,Siyou Li,Juexi Shao,Wenqing Wu,Chen Feng,Ioannis Patras
机构: Southeast University (东南大学); Universität der Bundeswehr München (慕尼黑国防大学); Queen Mary, University of London (伦敦玛丽女王大学); Nanjing University of Science and Technology (南京理工大学); Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised CRMOT, using only object category labels as coarse-grained supervision. In the first stage, we design an Affinity-guided Cross-view Re-prompting strategy to refine and associate SAM3-generated tracklets across cameras, producing reliable cross-view pseudo labels for subsequent training. In the second stage, we introduce ViewSAM, a CRMOT model built upon SAM2 that explicitly models view-aware cross-modal semantics. By formulating view-induced variations as learnable conditions, ViewSAM bridges the gap between view-variant visual observations and view-invariant textual expressions, enabling robust cross-view referring tracking with only approximately 10% additional parameters. Extensive experiments demonstrate that ViewSAM achieves SOTA performance under weak supervision and remains competitive with fully supervised methods.
[CV-25] AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
【速读】:该论文旨在解决高分辨率图形用户界面(GUI)中视觉-语言模型(VLMs)的接地(grounding)性能下降问题,即在密集布局和小交互元素场景下,由于现代显示器分辨率与模型输入约束之间的分辨率差距导致的空间定位不准确。解决方案的关键在于提出一种无需训练、基于不确定性的主动视觉搜索框架AutoFocus:其核心创新是利用坐标生成过程中token级困惑度(perplexity)来表征空间不确定性,并通过采样多个坐标假设将其轴向困惑度转化为各向异性高斯空间概率场,从而显式建模方向性不确定性;在此基础上,生成全局与局部区域建议,并引入形状感知缩放(Shape-Aware Zooming)策略,在精确定位与上下文保留之间取得平衡,最终通过基于视觉提示的聚合步骤实现结构化比较以选择最一致的预测结果。
链接: https://arxiv.org/abs/2605.02630
作者: Ruilin Yao,Shegnwu Xiong,Tianyu Zou,Shili Xiong,Yi Rong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures
Abstract:Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.
[CV-26] Rethinking Low-Light Image Enhancement: A Log-Domain Intensity–Chromaticity Decoupling Perspective
【速读】:该论文旨在解决低光照图像增强中异常通道放大和色噪声问题,这些问题会导致增强结果出现色彩失真或细节劣化。解决方案的关键在于引入基于解耦表示(decoupled representation)的显式重建约束,通过该约束有效抑制异常通道增益和色噪声,从而提升图像质量与下游任务性能。实验表明,该方法在多个基准数据集上均取得优异的定量指标(如在LOLv2-Real上达到29.71 dB PSNR和0.89 SSIM),且DarkFace实验进一步验证了其在低光条件下提升人脸检测效果的能力。
链接: https://arxiv.org/abs/2605.02627
作者: Guangrui Bai,Yifan Mei,Yahui Deng,Yuhan Chen,Yuze Qiu,Wenhai Liu,Erbao Dong
机构: University of Science and Technology of China (中国科学技术大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Knowledge-Based Systems
Abstract:Explicit reconstruction constraints derived from the decoupled representation are further imposed to suppress abnormal channel amplification and chromatic noise. Experiments on LOLv2-Real, MIT-Adobe FiveK, and LSRW show that the proposed method achieves competitive or superior quantitative and visual performance, reaching 29.71 dB PSNR and 0.89 SSIM on LOLv2-Real. DarkFace experiments further indicate improved downstream face detection under low-light conditions. Code and pretrained models are available at: this https URL.
[CV-27] Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval
【速读】:该论文旨在解决视频时刻检索(Video Moment Retrieval, VMR)中假设每个自然语言查询仅对应单一匹配时刻的局限性,这一假设在真实场景中往往不成立——查询可能对应多个时刻或无匹配时刻。为此,作者提出广义时刻检索(Generalized Moment Retrieval, GMR),一个统一的设定,要求模型能准确检索出全部相关时刻或预测空集(null-set)。解决方案的关键在于:1)构建大规模、高质量的 Soccer-GMR 基准数据集,基于挑战性的足球视频并采用灵活的半自动化标注流程结合人工验证,确保数据的真实性与多样性;2)设计一套综合评估协议,包含针对空集拒绝、正向查询定位和端到端 GMR 性能的互补指标;3)提出两种建模范式下的强基线方法:一种是轻量级插件式 GMR 适配器,用于增强判别式 VMR 模型;另一种是面向多模态大语言模型(MLLMs)微调的 GMR 专用 GRPO 奖励机制,从而系统性提升 GMR 任务表现。
链接: https://arxiv.org/abs/2605.02623
作者: Yiming Ding,Siyu Cao,Luyuan Jiao,Yixuan Li,Zitong Wang,Zhiyong Liu,Lu Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing University of Posts and Telecommunications (北京邮电大学); Wuhan University (武汉大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Code and dataset: this https URL . Keywords: video moment retrieval, temporal grounding, benchmark, multi-modal learning
Abstract:Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.
[CV-28] Global-Local Feature Decoding with Adapter-Guided SAMv2 for Salient Object Detection
【速读】:该论文旨在解决生成式 AI(Generative AI)时代下显著性目标检测(Salient Object Detection, SOD)任务中基础模型利用不足、训练成本高及小样本易过拟合的问题。其解决方案的关键在于提出GLASSNet框架,该框架采用冻结的SAMv2作为编码器,并引入轻量级、空间感知的卷积适配器(adapter),将可学习参数减少超过97%;同时设计双解码器架构——一个捕获全局长程语义信息以增强一致性,另一个聚焦局部细节如边缘与纹理以提升精度,最终通过融合互补特征生成兼具全局连贯性与局部精确性的显著图,从而在标准SOD和伪装目标检测基准上超越现有方法。
链接: https://arxiv.org/abs/2605.02616
作者: Morteza Moradi,Mohammad Moradi,Simone Palazzo,Ali Borji,Concetto Spampinato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Salient Object Detection (SOD) remains an essential yet underexplored task in the era of large-scale vision models. Although foundation models like SAM exhibit strong generalization, their potential for SOD is not fully realized, and training or fully fine-tuning them is computationally expensive and prone to overfitting under limited data. To overcome these challenges, we introduce GLASSNet, a Global-Local feature decoding framework that uses SAMv2 as a frozen encoder paired with a lightweight, spatially aware convolutional adapter-reducing learnable encoder parameters by over 97%. To enhance saliency quality, GLASSNet employs a dual-decoder architecture: one decoder captures global, long-range semantics with an expanded receptive field, while the other captures fine local details such as edges and textures. Fusing these complementary cues yields saliency maps that combine global coherence with local precision, producing accurate final masks. Extensive experiments on standard SOD and camouflaged object detection benchmarks show that GLASSNet surpasses state-of-the-art methods, demonstrating the power of frozen foundation models combined with targeted adaptation and global-local decoding.
[CV-29] Validation of an AI-based end-to-end model for prostate pathology using long-term archived routine samples
【速读】:该论文旨在解决生成式 AI (Generative AI) 在前列腺病理诊断中跨样本制备和长期保存条件下泛化能力不足的问题。其解决方案的关键在于提出并验证了GleasonAI模型——一种基于注意力机制的多实例学习(Multiple Instance Learning, MIL)模型,该模型在包含10,366个活检组织核心、来自14个瑞典地区且采集时间跨度达17年(1998–2015)的独立验证队列中展现出优异的ISUP分级一致性(整体加权kappa值为0.86),且性能稳定不受时间变化影响,显著优于基础模型方法,并揭示了AI评分等级与前列腺癌特异性死亡率之间存在显著预后梯度,从而证明了该模型具备良好的临床泛化潜力与病理档案作为大规模AI开发及回顾性预后研究资源的价值。
链接: https://arxiv.org/abs/2605.02614
作者: Xiaoyi Ji,Renata Zelic,Oskar Aspegren,Nita Mulliqi,Michelangelo Fiorentino,Francesca Giunchi,Luca Molinaro,Sol Erika Boman,Lorenzo Richiardi,Andreas Pettersson,Per Henrik Vincent,Martin Eklund,Olof Akre,Kimmo Kartasalo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 47 pages, 10 figures, 9 tables
Abstract:Artificial intelligence (AI) is becoming a clinical tool for prostate pathology, but generalization across variations in sample preparation and preservation over prolonged time periods remains poorly understood. We evaluated GleasonAI, an end-to-end attention-based multiple instance learning model, on an independent validation cohort comprising 10,366 biopsy cores from 1,028 patients across 14 Swedish regions, using archival diagnostic specimens from the ProMort cohorts collected between 1998-2015. The model achieved an overall quadratic-weighted kappa of 0.86 for core-level ISUP grading, comparable to several experienced pathologists and consistent across geographic regions. Notably, performance remained stable across the 17-year collection period, demonstrating robustness to time-related variation in archival material, a property not consistently observed with foundation model-based approaches, with exploratory analysis demonstrating a significant prognostic gradient across AI-assigned grade groups for prostate cancer-specific mortality. These findings support the generalizability of the AI grading model and demonstrate the potential of pathology archives as a large-scale resource for AI development, validation, and retrospective prognostic research.
[CV-30] Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
【速读】:该论文旨在解决源域数据不可获取的无源域自适应(Source-Free Domain Adaptation, SFDA)问题,其核心挑战在于现有方法仍依赖预训练的源模型,未能真正实现“无源”特性。为此,作者提出更严格的设定——仅使用随机初始化模型、视觉-语言(Vision-Language, ViL)模型和未标注目标域数据的ViL-Only Domain Adaptation(VODA)框架。解决方案的关键在于引入两阶段去噪区域蒸馏(Two-Stage Denoised-Region Distillation, TS-DRD):第一阶段利用ViL模型引导模型预热,第二阶段挖掘ViL模型与待适应模型共有的去噪区域特征,从而提供更干净的监督信号用于知识蒸馏,显著提升了在Office-Home、VisDA和DomainNet-126等基准上的性能表现。
链接: https://arxiv.org/abs/2605.02604
作者: Zhou Bingtao,Xiang Mian,Ning Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have introduced Vision-Language (ViL) models to guide the adaptation process, in these methods, we observe that for the same target domain, different source models yield minimal variation in final results, indicating the source model itself has limited impact. Motivated by this, we propose ViL-Only Domain Adaptation (VODA) , a stricter setting that eliminates all dependencies on source domain, relying solely on a randomly initialized model, a ViL model, and unlabeled target data. We analyze the adaptation dynamics of VODA and introduce Two-Stage Denoised-Region Distillation (TS-DRD) , a two-stage framework that first warms up the model with ViL guidance, then seek a Denoised-Region inherent in both the ViL and adapting model, yielding cleaner supervision for distillation. Experiments on Office-Home, VisDA, and DomainNet-126 show that under VODA, TS-DRD achieves competitive or superior performance to existing SFDA methods that still use source models, demonstrating its effectiveness and the potential of the VODA setting.
[CV-31] Representation learning from OCT images
【速读】:该论文旨在解决视网膜光学相干断层扫描(Optical Coherence Tomography, OCT)图像自动化分析中面临的挑战,特别是如何通过表示学习(representation learning)减少对专家标注的依赖、提升跨设备与人群的诊断一致性,并应对大规模数据处理需求。其解决方案的关键在于系统性地梳理和分类从早期深度学习到最新基础模型(foundation models)及视觉-语言系统等各类表示学习范式,包括监督学习(CNN与Transformer架构)、自监督与半监督方法、生成式模型、3D体素建模、多模态表示学习以及大规模预训练模型;同时构建统一的问题形式化框架,明确各方法的数学基础,并指出当前研究中的局限性与未来方向,如体积基础模型预训练、不确定性感知学习、联邦与隐私保护训练、公平性与偏见缓解等,从而为该领域提供结构化发展路径与前沿洞察。
链接: https://arxiv.org/abs/2605.02589
作者: Hedi Tabia,Désiré Sidibé,Nawres Khlifa,Ahmed Tabia,Ines Rahmany,Noura Aboudi,Zainab Haddad,Hajer Khachnaoui,Hsouna Zgolli
机构: IBISC Univ. Evry Université Paris-Saclay(伊维西大学巴黎-萨克雷大学); University of Tunis El Manar(突尼斯大学); ESIEE Paris - Université Gustave Eiffel(巴黎高等电子工程学院-居斯塔夫·埃菲尔大学); FST Sidi-Bouzid - University of Kairouan(西迪布祖德理工学院-凯鲁安大学); Department A - Hedi Raies of Ophthalmology Institut(眼科Hedi Raies部门研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Optical Coherence Tomography (OCT) has become one of the most used imaging modality in ophthalmology. It provides high-resolution, non-invasive visualization of retinal microarchitecture. The automated analysis of OCT images through representation learning has emerged as a central research frontier. This has mainly been driven by the clinical need to process large acquisition volumes. The objective is to reduce the reliance on expert annotation, and improve diagnostic consistency across devices and populations. This survey provides a comprehensive and structured review of representation learning methods for retinal OCT image analysis. It covers the period from early deep learning approaches to the most recent developments in foundation models and vision-language systems. We organize the literature along a principled taxonomy of learning paradigms, encompassing supervised learning with CNN-based and transformer-based architectures, self-supervised and semi-supervised methods, generative approaches, as well as 3D volumetric modeling, multimodal representation learning, and large-scale pretrained foundation models. For each paradigm, we analyze the core methodological contributions, identify persistent limitations, and trace the connections between successive approaches. We further provide a structured overview of publicly available OCT datasets, discuss evaluation protocol considerations, and present a unified problem formulation that situates each learning paradigm within a common mathematical framework. Building on this analysis, we identify and discuss the most pressing open research directions emerging in the literature. This includes volumetric foundation model pretraining, uncertainty-aware representation learning, federated and privacy-preserving training, fairness and bias mitigation, concept-based interpretability,…
[CV-32] StableMind: Source-Free Cross-Subject fMRI Decoding with Regularized Adaptation
【速读】:该论文旨在解决跨被试功能性磁共振成像(fMRI)解码中因新被试数据有限且历史数据不可获取而导致的性能下降问题。核心挑战在于两个方面:一是大脑侧不稳定,即不同被试间fMRI响应差异大;二是图像侧监督不可靠,即细粒度视觉细节难以被有限的fMRI信号稳定支持。解决方案的关键在于提出StableMind框架,通过两类正则化策略提升适应性能:(1) 利用预训练模型的岭投影作为先验约束来稳定大脑表征,并采用基于傅里叶变换的特征级脑增强以提高个体差异鲁棒性;(2) 引入难度感知的图像模糊机制进行脑-图像对齐,削弱弱支持的细粒度视觉细节影响,同时保留稳定的视觉结构,从而增强图像监督的可靠性。
链接: https://arxiv.org/abs/2605.02586
作者: Jintao Guo,Lin Wang,Shumeng Li,Jian Zhang,Yulin Zhou,Luyang Cao,Hairong Zheng,Yinghuan Shi
机构: Nanjing University (南京大学); School of Intelligence Science and Technology, Nanjing University (南京大学智能科学与技术学院); School of Electrical and Electronic Engineering, Nanyang Technological University (南洋理工大学电气与电子工程学院); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures
Abstract:Existing cross-subject fMRI decoding methods typically train a model on multiple scanned subjects and then adapt it to a new subject using substantial paired fMRI-image data. However, in realistic scenarios, new-subject fMRI data are often limited due to costly data acquisition, and raw data from previous subjects may be inaccessible, leading existing methods to suffer performance degradation during new-subject adaptation. In this paper, we identify that this degradation stems from two key issues: brain-side instability caused by large subject differences in fMRI responses, and image-side supervision unreliability caused by fine-grained visual details that are not reliably supported by limited fMRI signals. To address these challenges, we propose StableMind, a regularized adaptation framework designed to improve brain-side representation stability and image-side supervision reliability. (1) To stabilize brain representations, StableMind reuses ridge projections from the pretrained model as adaptation priors to constrain limited-data new-subject adaptation, and applies Fourier-based feature-level brain augmentation to improve robustness to individual variability. (2) To improve image supervision reliability, StableMind introduces difficulty-aware image blur for brain-image alignment, reducing the influence of fine-grained visual details that are weakly supported by limited fMRI signals while preserving stable visual structure. Experiments on the Natural Scenes Dataset under a unified 1-hour adaptation protocol demonstrate that StableMind achieves 84.02% image retrieval accuracy and 81.66% brain retrieval accuracy averaged over four subjects, surpassing the state-of-the-art method by 5.71% brain retrieval accuracy with fewer trainable adaptation parameters. Our code is available at this https URL.
[CV-33] Stylistic Attribute Control in Latent Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models)在风格属性控制方面的精确性问题,即现有方法难以实现对图像风格特征的细粒度调节,常导致内容语义的 unintended 修改。其解决方案的关键在于:首先通过合成数据集学习解耦的编辑方向(disentangled editing directions),以实现对特定风格属性(如轮廓、局部对比度、水彩化效果等)的参数化控制;其次利用指导组合(guidance composition)缩小风格微调模型与基础模型之间的领域差距,从而在保持原始图像语义不变的前提下进行风格调整;最后引入训练正则化损失和优化的零条件嵌入(null-conditional embeddings)提升真实图像编辑的一致性和可控性。
链接: https://arxiv.org/abs/2605.02583
作者: Max Reimann,Benito Buchheim,Jürgen Döllner
机构: Hasso-Plattner-Institute, University of Potsdam, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have revolutionized image synthesis and editing, but precise control over stylistic attributes remains a challenge, often causing unintended content modifications. We propose an approach for fine-grained parametric control of stylistic attributes in latent diffusion models by learning disentangled editing directions from synthetic datasets. We use guidance composition to close the domain gap between stylistically finetuned and foundation models, preserving the original image semantics while applying stylistic adjustments. To ensure consistent edits, we introduce a training regularization loss and enhance DDIM inversion with optimized null-conditional embeddings for real image editing. We validate our approach by learning from stylistically filtered synthetic datasets varying a range of stylistic attributes, including outlines, local contrast, watercolorization effects, and geometric patterns. Our evaluations demonstrate that compared to current text-based editing techniques, our method offers well-integrated, more precise and continuously adjustable stylistic modifications.
[CV-34] Hyp2Former: Hierarchy-Aware Hyperbolic Embeddings for Open-Set Panoptic Segmentation
【速读】:该论文旨在解决开放集全景分割(Open-Set Panoptic Segmentation, OPS)中未知物体识别的挑战,即在不依赖显式未知类别标注的情况下,如何有效区分未知对象与已知类别的分布内样本。现有方法通常将已知类别视为扁平标签集合,忽略了类别间的语义层次结构,导致对未知对象的判别能力受限。解决方案的关键在于提出Hyp2Former框架,其通过在双曲空间(hyperbolic space)中持续学习已知类别的层次语义相似性,构建一个具有多层语义抽象能力的结构化嵌入空间;这种设计使未知对象即使无法被精确归类为特定已知类别,仍能因靠近高层语义概念(如“动物”或“物体”)而被可靠检测,从而实现未知对象发现与分布内鲁棒性的最优平衡。
链接: https://arxiv.org/abs/2605.02580
作者: Yao Lu,Rohit Mohan,Florian Drews,Yakov Miron,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); Bosch Research, Robert Bosch GmbH (罗伯特·博世研发部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes. In this work, we propose Hyp2Former, an end-to-end framework for OPS that does not require explicit modeling of unknowns during training, and instead learns hierarchical semantic similarities continuously in hyperbolic space. By explicitly encoding hierarchical relationships among known categories, the model learns a structured embedding space that captures multiple levels of semantic abstraction. As a result, unknown objects that cannot be confidently classified as known categories still remain in close proximity to higher-level concepts (e.g., an unknown animal remains closer to “animal” or “object” than to unrelated concepts such as “electronics” or “stuff”) and can therefore be reliably detected, even if their fine-grained category was not represented during training. Empirical evaluations across multiple public datasets such as MS COCO, Cityscapes, and LostFound demonstrate that Hyp2Former outperforms existing methods on OPS, achieving the best balance between unknown object discovery and in-distribution robustness.
[CV-35] Self-Supervised Spatial And Zero-Shot Angular Super-Resolution by Spatial-Angular Implicit Representation For Rotating-View SNR-Efficient Diffusion MRI
【速读】:该论文旨在解决介观尺度扩散磁共振成像(mesoscale diffusion MRI, dMRI)中因旋转视角厚切片采集方式所需大量旋转视图以满足奈奎斯特采样而导致扫描时间过长的问题。其解决方案的关键在于提出一种自监督的空间-角度隐式神经表示(Spatial-Angular Implicit Neural Representation, SA-INR),该模型通过一个条件为b=0结构先验和b方向的FiLM(Feature-wise Linear Modulation)机制的多层感知机(MLP)实现,仅需每个扩散方向一个视图即可重建高分辨率dMRI数据,从而大幅加速成像过程。该方法不仅能够准确重构训练过的b方向(空间超分辨率,spatial SR),还能学习连续q空间表示,实现对未见b方向的“零样本”高保真合成(角度超分辨率,angular SR),突破了传统采样限制,同时提升了下游各向异性张量成像(DTI)建模的定量准确性。
链接: https://arxiv.org/abs/2605.02575
作者: Yinzhe Wu,Hongyu Rui,Fanwen Wang,Jiahao Huang,Zi Wang,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Rotating-view thick-slice acquisition is highly SNR-efficient for mesoscale diffusion MRI (dMRI) but requires numerous rotating views to satisfy Nyquist sampling, resulting in long scan time. We propose a self-supervised Spatial-Angular Implicit Neural Representation (SA-INR) that reconstructs high-resolution dMRI from a single view per diffusion direction, representing a massive acceleration. Our model, an MLP conditioned on a b=0 structural prior and the b-direction via FiLM, is trained end-to-end on the anisotropic input. The framework not only accurately reconstructs the trained b-directions (spatial SR) but also learns a continuous q-space representation, enabling high-fidelity “zero-shot” synthesis of unseen b-directions (angular SR). On simulated data, our method achieved high fidelity for both trained (34.82 dB) and unseen (33.08 dB) directions. Most importantly, the synthesized angular data also improved the quantitative accuracy of downstream DTI model fitting. Our SA-INR framework breaks the classical sampling limits, paving the way for fast, quantitative high-resolution dMRI.
[CV-36] Automated In-the-Wild Data Collection for Continual AI Generated Image Detection
【速读】:该论文旨在解决生成式 AI(Generative AI)图像检测器在面对分布偏移(distribution shifts)和新兴生成模型时性能下降的问题。其解决方案的关键在于提出一种以数据为中心的持续适应框架,通过结合真实世界数据(in-the-wild data)与生成器驱动数据(generator-driven data)来实现检测器的动态更新:一方面利用弱监督管道自动构建真实场景下的训练数据集,另一方面引入少量生成器驱动数据即可有效适应新模型,并在持续学习框架中融合两类数据以增强鲁棒性并缓解灾难性遗忘(catastrophic forgetting)。实验表明,该方法显著提升了两个先进检测器的平均准确率,分别达到+9.14%和+8%。
链接: https://arxiv.org/abs/2605.02567
作者: Thanasis Pantsios,Dimitrios Karageorgiou,Christos Koutlis,George Karantaidis,Olga Papadopoulou,Symeon Papadopoulos
机构: Information Technology Institute, CERTH, Thessaloniki, Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of generative Artificial Intelligence (AI) has introduced significant challenges for reliable AI-generated image detection. Existing detectors often suffer from performance degradation under distribution shifts and when encountering newly emerging generative models. In this work, we propose a data-centric continual adaptation framework for updating detectors in evolving environments. We show that both in-the-wild data and generator-driven data are essential for adapting detectors. We introduce an automated, weakly supervised pipeline for constructing in-the-wild datasets through fact-check article retrieval. Additionally, we demonstrate that incorporating even a small amount of generator-driven data during training enables effective adaptation to newly emerging models, while combining it with in-the-wild data within a continual learning framework enables robust adaptation and mitigates catastrophic forgetting. Extensive experiments on two state-of-the-art detectors show significant improvements of +9.14% and +8% in average accuracy, respectively.
[CV-37] Low-Latency Embedded Driver Monitoring System with a Multi-Task Neural Network
【速读】:该论文旨在解决道路交通事故中由驾驶员分心和疲劳导致的严重问题,提出了一种基于摄像头的实时评估方法,用于量化驾驶员的注意力、警觉性和分心行为。解决方案的关键在于设计了一个轻量级多任务神经网络模型,能够在单次前向传播中同时预测面部区域的多个指标,并结合完整的执行流程实现对驾驶员状态的实时估计,从而在满足严格实时性要求的同时,显著降低计算资源消耗,适用于计算预算受限的嵌入式系统部署。
链接: https://arxiv.org/abs/2605.02563
作者: Carmelo Scribano,Giovanni Cappelletti,Elia Giacobazzi,Giorgia Franchini,Paolo Burgio,Marko Bertogna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at: RAGE 2026@CPS-IoT Week 2026
Abstract:Road traffic accidents remain a significant global concern, with the majority attributed to human factors such as driver distraction and fatigue. This study proposes a camera-based approach to derive useful indicators to assess driver attentiveness and alertness. The proposed pipeline jointly satisfies the stringent real-time requirements imposed by the critical application and minimizes the computational requirements to allow for deployment on a tight computational budget. To this end, we develop a lightweight multi-task neural network that predicts multiple indicators for the face region in a single forward pass. The developed model is integrated into a complete execution workflow to produce a real-time estimate of attentiveness, fatigue, and engagement in distracting activities.
[CV-38] mPose-TF-ASF: Two-Stage Bidirectional Stroke Context Fusion for Badminton Stroke Classification
【速读】:该论文旨在解决羽毛球击球类型预测中难以建模丰富时序上下文信息的问题,从而提升精细化运动分析与战术决策支持的准确性。其解决方案的关键在于提出了一种名为TemPose-TF-ASF(Adjacent-Stroke Fusion)的上下文感知扩展方法,通过引入前后击球类型的双向时序依赖关系来增强击球识别能力;具体而言,该方法采用两阶段训练与推理策略,利用基线模型的初步预测作为估计的时序上下文,并以此指导ASF模块与分类器的联合优化,从而实现对击球动作更精准的时序建模与泛化性能提升。
链接: https://arxiv.org/abs/2605.02558
作者: Tzu-Yu Liu,Duan-Shin Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate badminton stroke prediction is crucial for fine-grained sports analysis and tactical decision support. However, existing methods struggle to model rich temporal context. This paper introduces \emphTemPose-TF-ASF (Adjacent-Stroke Fusion), a context-aware extension of \emphTemPose. It enhances stroke recognition by incorporating stroke-type information from both preceding and subsequent strokes. A two-stage training and inference strategy is adopted. Preliminary predictions from the baseline model are reused as estimated temporal context. These predictions guide the joint optimization of the \emphASF module and the classifier. By explicitly modeling bidirectional temporal stroke dependencies, the proposed method can be seamlessly integrated into existing state-of-the-art models. Experiments on a large-scale badminton match dataset show consistent improvements over the baseline and its variants in terms of Accuracy and Macro-F1. Moreover, integrating \emphASF into other advanced methods yields notable performance gains. These results demonstrate strong transferability and generalization capability.
[CV-39] Improving Model Safety by Targeted Error Correction ICPR
【速读】:该论文旨在解决机器学习模型在关键应用场景中因高后果错误(如误诊或误判)导致的安全性问题,尤其关注如何识别并纠正非人类主导的高风险错误,而非仅处理常规的人类级错误。其解决方案的关键在于提出一种双分类器梯度提升决策树(GBDT)流水线架构:第一个分类器用于区分“常规人类类错误”与“高风险非人类误分类”,第二个分类器则基于此判断执行保守修正策略。该方法在动物品种识别、皮肤病变诊断(ISIC 2018)和前列腺组织病理学(SICAPv2)三个领域验证有效,显著降低了危险误分类率(如ISIC中减少34.1%),同时保持极低的推理延迟(平均增加1.7%),实现了无需重新训练模型即可大幅提升系统安全性,体现了后处理校正(post-hoc correction)在可信人工智能(Trustworthy AI)中的价值。
链接: https://arxiv.org/abs/2605.02544
作者: Abolfazl Mohammadi-Seif,Ricardo Baeza-Yates
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in the Proceedings of International Conference on Pattern Recognition (ICPR) 2026. The final published version should be cited
Abstract:The widespread adoption of machine learning in critical applications demands techniques to mitigate high-consequence errors. Our method utilizes a dual-classifier GBDT pipeline to distinguish routine human-like errors from high-risk non-human misclassifications. Evaluated across three domains, animal breed classification, skin lesion diagnosis (ISIC 2018), and prostate histopathology (SICAPv2), our framework demonstrates robust safety improvements. To address real-world deployment concerns, our results confirm the pipeline introduces negligible inference latency (1.60% overhead for the animal dataset, 1.84% for ISIC, and 1.70% for SICAPv2) while outperforming traditional Maximum Class Probability (MCP) baselines in correction precision. Our conservative correction strategy successfully reduced dangerous non-human errors by 34.1% in ISIC and 12.57% in SICAPv2, improving super-class diagnostic safety to 90.41% and 92.13% respectively. This proves that safety-critical reliability can be substantially enhanced post-hoc without expensive model retraining. keywords: Error Analysis, Post-hoc Correction, Trustworthy AI. Comments: This work has been accepted for publication in the Proceedings of International Conference on Pattern Recognition (ICPR) 2026. The final published version should be cited Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.02544 [cs.AI] (or arXiv:2605.02544v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-40] MooD: An Efficient VA-Driven Affective Image Editing Framework via Fine-Grained Semantic Control
【速读】:该论文旨在解决现有情感图像编辑(Affective Image Editing, AIE)方法在推理效率和情感表示方式上的局限性,特别是其依赖离散情感标签而难以捕捉复杂细微情绪的问题。解决方案的关键在于提出MooD框架,首次直接利用连续的效价-唤醒度(Valence-Arousal, VA)值进行细粒度且高效的AIE;其核心创新包括:1)引入VA感知检索策略以连接模糊的情感数值与具体的视觉语义;2)结合视觉迁移与语义引导机制实现可控的情感编辑;3)构建AffectSet数据集,提供VA标注以支持模型优化与评估。实验表明,MooD在情感可控性和视觉保真度方面均优于现有方法,并具备高计算效率。
链接: https://arxiv.org/abs/2605.02521
作者: Xinyi Yin,Yiduo Wang,Tingqi Hu,Meicong Si,Yunyun Shi,Shi Chen,Hao Wang,Junxiao Xue,Xuecheng Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Affective image editing (AIE) aims to edit visual content to evoke target emotions. However, existing methods often overlook inference efficiency and predominantly depend on discrete emotion representations, which to some extent limits their practical applicability and makes it challenging to capture complex and subtle human emotions. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values for fine-grained and efficient AIE. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and concrete visual semantics. Building upon this, MooD integrates visual transfer and semantic guidance to achieve controllable AIE. Furthermore, we construct AffectSet, a VA-annotated dataset to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design. Our code and data will be made publicly open soon.
[CV-41] Multispectral Blind Image Super-Resolution for Standing Dead Tree Segmentation
【速读】:该论文旨在解决利用低分辨率航空影像进行林木死亡状态(standing dead trees)分割时面临的挑战,尤其是受限于传感器性能和标注数据稀缺的问题。其核心解决方案是提出一种通用的盲超分辨率框架,结合注意力引导的域自适应网络(Attention-Guided Domain Adaptation Networks, ADA-Nets),在仅使用未配对样本的情况下学习从低分辨率到高分辨率多光谱图像域的映射关系,从而提升图像质量并支持下游分割任务。该方法不依赖于合成的低分辨率图像,而是模拟真实场景中低端传感器采集的数据(如饱和、噪声和低对比度等退化类型),显著提升了无高分辨率标注条件下的分割性能(Dice分数达54%),且可泛化应用于多种图像退化问题。
链接: https://arxiv.org/abs/2605.02471
作者: Mete Ahishali,Anis Ur Rahman,Einari Heinaro,Aysen Degerli,Samuli Junttila
机构: University of Helsinki (赫尔辛基大学); CSC – IT Center for Science Ltd. (CSC - 科学技术中心有限公司); KOKO Forest Ltd. (KOKO森林有限公司); VTT Technical Research Centre of Finland (芬兰技术研究中心); University of Eastern Finland (东芬兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Mapping standing dead trees is crucial for acquiring information on the effects of climate change on forests and forest biodiversity. However, leveraging high-quality aerial imagery for dead tree segmentation poses challenges due to limitations in sensor availability and the scarcity of annotated data. In this study, we propose a generic blind super-resolution framework that incorporates Attention-Guided Domain Adaptation Networks (ADA-Nets) to learn the mapping from low-resolution to high-resolution multispectral image domains. Our approach operates solely on unpaired samples, mimicking real-world conditions, i.e., low-resolution images are not synthetically obtained by downsampling the high-resolution images. Moreover, the proposed method serves as a general-purpose restorer addressing several image degradation types, including saturation, noise, and low contrast that typically occur in low-resolution images acquired by low-end sensors. To the best of our knowledge, this is the first study to perform real-world and generic super-resolution for multispectral data in the scope of standing dead tree segmentation. Experimental evaluations demonstrate segmentation performances of 54% and 64% in Dice scores. Notably, the first result is obtained without using any high-resolution annotations; the segmentation network is trained on super-resolved low-resolution images, while evaluation is performed on the high-resolution data. We publicly share the aerial multispectral dataset with manually annotated labels at this https URL.
[CV-42] ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction CVPR2026
【速读】:该论文旨在解决单图像高动态范围(High Dynamic Range, HDR)重建中因过曝区域细节饱和和欠曝区域噪声放大而导致的病态问题,同时克服现有基于扩散模型的方法在暴露依赖性退化建模上的不足及迭代采样带来的高计算开销。其解决方案的关键在于提出一种新颖的一步式生成框架ExpoCM,将HDR重建重新建模为概率流常微分方程(Probability Flow ODE, PF-ODE),并通过暴露依赖扰动构建曝光感知的一致性轨迹;具体而言,首先利用软曝光掩码分离图像中的过曝、欠曝与正常曝光区域,并设计区域条件一致性轨迹以分别恢复饱和细节、抑制暗区噪声并保留可靠结构,且无需蒸馏过程即可实现单步推理;此外,引入曝光引导的亮度-色度损失(Exposure-guided Luminance-Chromaticity Loss)在CIE L*a*b*空间中对亮度与色度分量赋予曝光感知权重,有效缓解亮度偏差与色彩漂移问题,从而在多个基准数据集上实现最优保真度与感知质量,同时显著提升推理速度(相比DDPM快400倍以上,相比DDIM快20倍)。
链接: https://arxiv.org/abs/2605.02464
作者: Aoyu Liu,Zhen Liu,Ziyi Wang,Dian Chen,Bing Zeng,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Single-image HDR reconstruction aims to recover high dynamic range radiance from a single low dynamic range (LDR) input, but remains highly ill-posed due to detail saturation in over-exposed regions and noise amplification in under-exposed areas. While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. Specifically, a soft exposure mask is first constructed to separate the LDR image into over-, under-, and well-exposed regions. Based on this partition, region-conditioned consistency trajectories are designed to hallucinate saturated details, suppress noise in dark regions, and preserve reliable structures within a single, distillation-free inference step. To further enhance perceptual quality, we introduce an Exposure-guided Luminance-Chromaticity Loss in the CIE~ \textL^\texta^\textb^* space, which assigns exposure-aware weights to luminance and chromaticity components, effectively mitigating brightness bias and color drift. Extensive experiments on the HDR-REAL, HDR-EYE, and AIM2025 benchmarks demonstrate that ExpoCM achieves state-of-the-art fidelity and perceptual accuracy, while enabling over 400 \times and 20 \times faster inference compared to DDPM (1000 steps) and DDIM (50 steps), respectively.
[CV-43] Mtextsuperscript4Fuse: Lightweight State-Space MoE with a Cross-Scale Gating Bridge for Brain Tumor Segmentation CVPR2026
【速读】:该论文旨在解决3D脑肿瘤分割模型中存在的编码器-解码器能力失衡以及对大量输入数据依赖导致的计算密集和鲁棒性差的问题。其核心解决方案是提出一种轻量级网络M⁴Fuse,通过优化结构设计实现高效特征提取与跨域适应:关键创新在于以线性复杂度的分组状态空间混合器(grouped state space mixer)传播长程上下文信息,利用跨尺度双阶段门控桥(cross-scale dual-stage gating bridge)进行跳接特征去噪与对齐,并引入样本级混合专家(sample-level mixture-of-experts)机制以缓解多中心采集差异带来的影响,从而在显著减少参数量的同时提升分割性能和跨数据集稳定性。
链接: https://arxiv.org/abs/2605.02444
作者: Meihua Zhou,Xinyu Tong,Li Yang
机构: University of Chinese Academy of Sciences (中国科学院大学); Wannan Medical University (皖南医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages,3 figures,CVPR 2026 findings
Abstract:Encoder-decoder imbalance and the reliance on large input volumes make many 3D brain tumor segmentation models both compute-heavy and brittle. We present M\textsuperscript4Fuse, a lightweight network that prioritizes discriminative brain tumor cues over exhaustive appearance reconstruction. Our method balances encoder and decoder capacity and replaces depth expansion with a synergistic design: it propagates long-range context with linear complexity via a grouped state space mixer, denoises and aligns skip features using a cross-scale dual-stage gating bridge, and absorbs cross-site acquisition shifts with a sample-level mixture-of-experts. On the BraTS2019 and BraTS2021 benchmarks, M\textsuperscript4Fuse outperforms other lightweight excellent methods in both parameter count and performance. Even at a challenging input resolution of (64\times128\times128) (half that of existing excellent models), M\textsuperscript4Fuse reduces parameters by 62.63% and improves average performance by 0.09%. Ablations of key components validate the method’s exceptional parameter-to-accuracy efficiency and robustness across diverse data centers.
[CV-44] Anomaly-Preference Image Generation ICML2026
【速读】:该论文旨在解决有限异常样本下生成真实且多样异常样本的难题,以提升模型的泛化能力。现有方法常因分布错位(distribution misalignment)和过拟合问题,难以同时保证生成样本的保真度(fidelity)与多样性(diversity)。其解决方案的关键在于提出一种名为异常偏好优化(Anomaly Preference Optimization)的新范式,将异常生成重构为偏好学习任务,通过利用真实异常作为正向参考,从去噪轨迹偏差中直接提取优化信号,无需昂贵的人工标注;同时引入时间感知容量分配模块(Time-Aware Capacity Allocation),动态调整扩散过程中的模型容量分配,在高噪声阶段优先保障结构多样性,在低噪声阶段增强细粒度保真度,并在推理时采用分层采样策略调节一致性与对齐性的权衡,从而实现对生成质量的精确控制。
链接: https://arxiv.org/abs/2605.02439
作者: Fuyun Wang,Yuanzhi Wang,Xu Guo,Sujia Huang,Tong Zhang,Dan Wang,Hui Yan,Xin Liu,Zhen Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, this http URL mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning this http URL to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.
[CV-45] Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection ICML2026
【速读】:该论文旨在解决开放集监督异常检测(Open-set Supervised Anomaly Detection, OSAD)中因传统基于原型的方法采用单模态高斯先验建模正常数据,导致无法捕捉数据内在多模态特性、进而造成决策边界模糊的问题。其解决方案的关键在于提出混合原型流匹配(Mixture Prototype Flow Matching, MPFM)框架,该框架通过学习从正常特征分布到结构化高斯混合原型空间的连续变换来实现更精确的分布迁移;与传统流模型依赖单一速度向量不同,MPFM显式地将速度场建模为高斯混合先验,每个成分对应一个独立的正常类别,从而实现模式感知且语义一致的分布传输;此外,引入互信息最大化正则化器(Mutual Information Maximization Regularizer, MIMR)以防止原型坍缩并增强正常与异常样本间的可分性。
链接: https://arxiv.org/abs/2605.02438
作者: Fuyun Wang,Yuanzhi Wang,Xu Guo,Sujia Huang,Tong Zhang,Dan Wang,Hui Yan,Xin Liu,Zhen Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.
[CV-46] Multi-Rater Calibrated Segmentation Models
【速读】:该论文旨在解决医学图像分割模型在临床决策中缺乏可靠概率估计的问题,特别是当多个专家标注存在显著差异时,现有深度分割网络通常校准不足。其关键解决方案是将多标注监督重新建模为序数学习(ordinal learning)问题,将体素级别的标注者一致性视为有序目标,从而将预测置信度与训练数据中的实际标注变异性关联起来;通过引入序数感知的评分规则(如Ranked Probability Score损失),结合标准二分类目标,在不牺牲分割性能的前提下显著提升模型校准性。
链接: https://arxiv.org/abs/2605.02437
作者: Meritxell Riera-Marín,Javier García López,Júlia Rodríguez-Comas,Miguel A. González Ballester,Adrian Galdran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.
[CV-47] DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
【速读】:该论文旨在解决训练-free图像编辑方法中因重建路径使用不匹配时间步的噪声潜变量而导致的累积漂移问题,从而限制重建保真度。解决方案的关键在于提出DirectEdit方法,其核心思想是不再试图修正反演路径,而是直接对齐前向传播路径,实现精确重建与可靠特征共享;同时引入基于注意力特征注入和多分支掩码引导噪声混合的保持机制,有效平衡图像保真度与编辑灵活性。
链接: https://arxiv.org/abs/2605.02417
作者: Desong Yang,Mang Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at this https URL.
[CV-48] FEAT: Fashion Editing and Try-On from Any Design
【速读】:该论文旨在解决当前时尚设计与虚拟试穿方法在设计来源受限和完整穿搭支持不足的问题:现有方法仅依赖服装相关的图像输入,无法利用艺术作品、抽象图像或自然照片等多样化创意来源,且难以实现包含配饰的完整服饰组合的编辑与试穿。其解决方案的关键在于提出FEAT(Fashion Editing And Try-On from Any Design)框架,核心创新为两个模块:一是解耦双注入机制(Disentangled Dual Injection, DDI),能够从服装与非服装设计源中提取并选择性注入内容与风格特征;二是正交引导噪声融合机制(Orthogonal-Guided Noise Fusion, OGNF),通过正交投影去除残留衣物并采用区域特定噪声策略,实现无需训练即可对服装与配饰进行高质量虚拟试穿。
链接: https://arxiv.org/abs/2605.02393
作者: Soye Kwon,Keonyoung Lee,Dahuin Jung,Jaekoo Lee
机构: Kookmin University (酷克敏大学); Soongsil University (中央大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 2 tables
Abstract:Fashion design aims to express a designer’s creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
[CV-49] UnGAP: Uncertainty-Guided Affine Prompting for Real-Time Crack Segmentation
【速读】:该论文旨在解决结构健康监测中实时裂缝分割面临的随机不确定性(aleatoric uncertainty)问题,这类不确定性由光照变化、模糊和纹理歧义等因素引起,传统方法通常将不确定性估计视为后处理阶段的被动输出,未能将其反馈至特征学习过程中以优化模型表现。解决方案的关键在于提出一种名为UnGAP的新框架,其核心创新是引入不确定性提示特征调制器(Uncertainty-Prompted Feature Modulator, UPFM),将随机不确定性从被动指标转变为激活的视觉提示,通过像素级仿射变换动态校准特征分布;特别地,UPFM能将高方差区域(本应导致梯度抑制)转化为增强特征修正的正向信号,从而缓解异方差性导致的优化病理,同时结合边界感知检测头进一步提升预测精度,实现了高准确率与实时推理速度的平衡。
链接: https://arxiv.org/abs/2605.02380
作者: Conghui Li,Huanyu He,Xin Wang,Weiyao Lin,Chern Hong Lim
机构: Monash University (莫纳什大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time crack segmentation is vital for structural health monitoring but is plagued by aleatoric uncertainties arising from varying lighting, blur, and texture ambiguity. Current uncertainty-aware approaches typically treat uncertainty estimation as a passive endpoint for post-hoc analysis, failing to close the loop by feeding this information back to refine feature representations. We contend that independent pixel-wise heteroscedastic modeling is uniquely suited for crack segmentation, as cracks are defined by fine-grained local gradients rather than the global semantic coherence relied upon in general object segmentation. However, this approach suffers from a structural optimization pathology: high predicted variance attenuates loss gradients, effectively causing the model to ignore difficult samples and under-fit complex boundaries. To address these challenges, we propose UnGAP, a novel framework that establishes a closed-loop mechanism between uncertainty estimation and feature learning. Central to our approach is the Uncertainty-Prompted Feature Modulator (UPFM), which treats aleatoric uncertainty as an active visual prompt rather than a mere output. UPFM dynamically calibrates feature distributions through pixel-wise affine transformations. Crucially, this mechanism mitigates the heteroscedastic pathology by transforming high variance, which would otherwise indicate gradient suppression, into a constructive signal for stronger feature rectification in ambiguous regions. Additionally, a boundary-aware detection head is introduced to further constrain prediction precision. Extensive experiments demonstrate that UnGAP balances superior segmentation accuracy with real-time inference speed, effectively validating the benefit of transforming uncertainty from a passive metric into an active calibration tool.
[CV-50] Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在上下文学习(In-Context Learning, ICL)中表现脆弱的问题,其核心挑战在于存在一个“归纳鸿沟”(inductive gap):模型常基于错误的推理得出正确答案,却难以从示范样本中提取一致且可泛化的规则。此外,视觉层面还存在两个障碍:冗余视觉标记过多干扰文本线索,以及注意力分布偏向初始图像而忽视后续上下文。解决方案的关键在于提出一种结构化的归纳-演绎(inductive-deductive)框架,包含三个核心组件:基于相似性的视觉标记压缩模块以过滤冗余图像块、动态注意力重平衡机制以均衡各图像的关注度,以及链式思维(chain-of-thought)范式引导模型逐例分析、提炼规则并应用于查询;同时引入辅助学习管道,结合监督微调与基于可验证奖励的强化学习,强化忠实引用和噪声过滤能力。该方法在八个基准测试中显著优于标准ICL基线,证明了提升VLM真正归纳能力的有效性。
链接: https://arxiv.org/abs/2605.02378
作者: Haoyu Wang,Haonan Wang,Yuyan Chen,Jun Chen,Gang Liu,Qian Wang,Jiahong Yan,Yanghua Xiao
机构: Tencent QQ; Fudan University; Cornell University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.
[CV-51] Graph-Augmented Topological Internalization with Dual-Stream Classifiers for Medical Report Generation
【速读】:该论文旨在解决当前自动化医学报告生成(MRG)模型在处理胸部异常时存在的两大核心问题:一是主流方法将不同疾病视为孤立分类目标,忽视了疾病间的共现关系(co-occurrence),导致难以建模复杂的病理拓扑结构;二是缺乏有效的机制将诊断逻辑与视觉特征对齐,易产生特征幻觉(feature hallucination)。解决方案的关键在于提出一种图增强的双流医学报告生成框架(GDMRG),其核心创新包括:1)设计拓扑知识内化模块(TKI),利用图卷积网络(GCN)从全局疾病共现先验中生成显式的参数化权重矩阵,实现无需外部检索的拓扑知识注入;2)构建双流分类系统,主分支在拓扑约束下生成诊断提示,辅分支采用非对称优化策略动态校准高度不平衡样本的决策边界;3)引入诊断驱动的空间注意力机制(DGSA),通过高维临床语义重新校准视觉编码器,建立诊断与视觉定位之间的逻辑闭环,从而提升模型对复杂或细微病变的推理能力与可解释性。
链接: https://arxiv.org/abs/2605.02376
作者: Moyu Tang,Chupei Tang,Junxiao Kong,Di Wang,Tianchi Lu
机构: Lanzhou University (兰州大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated medical report generation, MRG, holds substantial value for alleviating radiologist workload and enhancing diagnostic efficiency. However, mainstream approaches typically treat diverse chest abnormalities as isolated classification targets. This paradigm often overlooks inherent disease co-occurrences and struggles to translate medical topological structures into explicit data correlations, constraining the model’s reasoning capacity on complex or subtle lesions. To address this, we propose a Graph-Augmented Dual-Stream Medical Report Generation with Topological Internalization, GDMRG. Our framework introduces a Topological Knowledge Internalization module, TKI, which leverages a Graph Convolutional Network, GCN, to generate an explicit parameterized weight matrix based on global disease co-occurrence priors. This facilitates efficient topological knowledge injection without relying on external retrieval mechanisms. Building upon this, we construct a dual-stream classification system: the main branch generates discrete diagnostic prompts under topological constraints, while the auxiliary branch employs an asymmetric optimization strategy to dynamically calibrate decision boundaries for highly imbalanced samples. Concurrently, to establish a logical closed loop between diagnosis and visual grounding, we design a diagnostic-driven Diagnosis-Guided Spatial Attention, DGSA, that utilizes high-dimensional clinical semantics to recalibrate the visual encoder, mitigating feature hallucinations. Comprehensive experiments on the MIMIC-CXR dataset demonstrate that GDMRG achieves competitive clinical efficacy, CE, while maintaining natural language fluency. Furthermore, our model exhibits robust zero-shot generalization on the IU X-Ray dataset. In summary, this work presents an integrated and interpretable paradigm for medical report generation.
[CV-52] Channel-Level Relation to Attentive Aggregation with Neighborhood-Homogeneity Constraint for Point Cloud Analysis
【速读】:该论文旨在解决3D点云理解中因现有特征判别机制局限于点级空间分布或通道响应,导致在多尺度点云网络深层中出现显著信息损失的问题。其解决方案的关键在于提出一种基于通道级度量增强机制的PointCRA网络,通过引入时间趋势变化作为新的评估维度,避免传统空间和通道注意力机制中权重维度坍塌引发的信息丢失;同时构建以邻域同质性为导向的多层级校准框架,并设计专用损失函数以提升通道判别能力,从而在低参数开销下实现对特征聚合过程的自适应修正,具备强可解释性和迁移能力。
链接: https://arxiv.org/abs/2605.02357
作者: Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Wenxuan Ji,Zichong Jia,Zifan Long,Tianyou Chen
机构: 1st Jiaqi Shi1,
2nd Jin Xiao*1,
3rd Xiaoguang Hu1,
4th Wenxuan Ji1,
5th Zichong Jia1,
6th Zifan Long1,
7th Tianyou Chen1
-
- Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 2. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 3. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 4. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 5. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 6. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所); 7. Institute of Artificial Intelligence, School of Computer Science and Technology, Nanjing University (南京大学计算机科学与技术学院人工智能研究所)
Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In 3D point cloud understanding, the core challenge lies in accurately capturing discriminative features within complex neighborhoods, which directly affects the execution precision of downstream tasks such as embodied AI and autonomous driving. Existing methods explore feature correlation discrimination but are limited to point-level spatial distribution or channel responses, enabling only coarse-grained level evaluation. For modern multi-scale point cloud networks, such coarse-grained metrics inevitably incur significant information loss in deeper layers. To address this issue, we propose a novel network equipped with a channel-level metric-based enhancement mechanism, termed the PointCRA network. Our core idea is to introduce temporal trend variation as a new evaluation dimension to avoid the information loss caused by weight dimension collapse in existing spatial and channel attention mechanisms. On this basis, we construct a multi-level calibration framework guided by neighborhood homogeneity for weight calibration, and design a dedicated loss function to enhance channel discriminability. The module effectively leverages the intrinsic feature priors of deep networks to adaptively correct the feature aggregation process, offering strong interpretability with low parameter overhead. Furthermore, our proposed method exhibits strong transferability, interpretability, and parameter efficiency. We validate the proposed method effectiveness on diverse datasets and benchmark models, and further demonstrate its rationality through extensive analytical experiments. Our PointCRA achieves 77.5% mIoU on the S3DIS dataset, 90.4% OA on the ScanObjectNN dataset, and 87.4% instance mIoU on the ShapeNetPart dataset. The code and pretrained weights are publicly available on GitHub:
[CV-53] Improving Imbalanced Multi-Label Chest X-Ray Diagnosis via CBAM-Enhanced CNN Backbones
【速读】:该论文旨在解决胸部X光影像诊断中传统人工解读效率低、依赖专家经验,以及深度学习方法在多标签分类任务中面临的类别不平衡和多种病灶共存定位难题。其解决方案的关键在于将卷积块注意力模块(Convolutional Block Attention Module, CBAM)嵌入到传统卷积神经网络(CNN)模块中,通过引入通道与空间双重注意力机制来增强特征表示能力,从而提升对多病灶共存场景下的分类性能,在ChestXray14数据集上实现了0.8695的平均AUC值,优于多个现有先进方法。
链接: https://arxiv.org/abs/2605.02328
作者: Duy Nguyen Huu,Duy Hoang Khuong,Ngu Huynh Cong Viet
机构: FPT University (FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Presented at FETC 2025
Abstract:Chest radiography is a widely used imaging modality for thoracic disease diagnosis, yet its conventional interpretation remains time-consuming and heavily dependent on expert knowledge. While deep learning has improved diagnostic efficiency through automated feature extraction, challenges such as class imbalance and the localization of multiple co-existing pathologies remain unsolved. In this paper, inspired by the strength of Convolutional Block Attention Module (CBAM) in feature refinement and the capability of CNN blocks in feature extraction, we propose a strategy to integrate CBAM into traditional CNN blocks to enhance performance in multi-label classification tasks. Our method achieves a mean AUC of 0.8695 on ChestXray14 dataset, outperforming several state-of-the-art this http URL source code is available at: this https URL
[CV-54] Open-access model for detecting openly dumped dispersed municipal solid waste from crowdsourced UAV imagery in Sub-Saharan Africa
【速读】:该论文旨在解决快速城市化背景下撒哈拉以南非洲地区市政固体废物(Municipal Solid Waste, MSW)管理难题,特别是由分散的非正式倾倒导致的空间监测困难问题。其解决方案的关键在于开发并验证了一个开源的深度学习模型,该模型基于众包无人机(UAV)影像进行训练和评估,在10个国家29个区域的多样化环境背景下实现了对开放式分散固废的高精度自动化检测。模型性能优异,揭示了废物积累模式的异质性,包括沿水道分布的热点区域(易加剧洪涝与公共卫生风险)及城市范围内的广泛散落,并表明废物堆积与人口密度及本地基础设施获取不足显著相关,而与区域发展指标关联较弱,凸显细粒度数据的重要性。通过开放模型工具,该研究使地方政府和社区测绘团队无需复杂技术背景即可将UAV影像转化为可操作的洞察,从而支持有针对性的干预措施和更有效的MSW管理。
链接: https://arxiv.org/abs/2605.02316
作者: Steffen Knoblauch,Ram Kumar Muthusamy,Luis M. A. Bettencourt,Costas Velis,Pierre Chrzanowski,Edward Charles Anderson,Pete Masters,Innocent Maholi,Antonio Inguane,Levi Szamek,Alexander Zipf
机构: HeiGIT at Heidelberg University (海德堡大学HeiGIT); Interdisciplinary Centre of Scientific Computing (IWR) at Heidelberg University (海德堡大学跨学科科学计算中心); GIScience Research Group at Heidelberg University (海德堡大学地理信息科学研究组); Urban Science Laboratory at The University of Chicago (芝加哥大学城市科学实验室); Santa Fe Institute (圣达菲研究所); Complexity Science Hub (复杂性科学 hub); Imperial College London (帝国理工学院); Global Facility for Disaster Reduction and Recovery (GFDRR) at World Bank (世界银行全球灾害减灾与恢复设施); Humanitarian OpenStreetMap Team (人道主义开放街图团队); Open Map Development Tanzania (坦桑尼亚开放地图开发); Data4Moz (数据4莫桑比克)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Managing municipal solid waste in rapidly urbanizing Sub-Saharan Africa remains challenging due to dispersed informal dumping and limited high-resolution datasets for spatial monitoring. We present an open-access deep learning model for automated detection of openly dumped dispersed solid waste via crowdsourced UAV imagery, trained and evaluated across 29 regions in 10 countries, encompassing diverse environmental contexts. A deep learning model trained on manually annotated image tiles achieved excellent performance in detecting openly dumped dispersed solid waste across all study regions. Predicted distributions reveal heterogeneous accumulation patterns, ranging from localized hotspots - often along waterways, where waste can exacerbate flood and public health risks - to more dispersed litter across urban areas. Waste accumulation is most strongly associated with population density and indicators of lack of local infrastructure access, whereas its relationship with broader measures of regional development is weaker, highlighting the importance of fine-scale data for understanding localized waste dynamics. By releasing the model, this study provides a ready-to-use tool for UAV imagery collected by municipalities and local mapping communities, enabling openly dumped dispersed solid waste monitoring without extensive technical expertise. This approach empowers local practitioners to convert UAV imagery into actionable insights, supporting targeted interventions and improved municipal solid waste management across Sub-Saharan Africa.
[CV-55] Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification
【速读】:该论文旨在解决胸部X光图像分类中因类别分布严重不均衡导致的梯度更新偏向多数类、特征漂移以及罕见但关键病灶识别性能差的问题。解决方案的关键在于提出一种基于指数移动平均(Exponential Moving Average, EMA)的动量锚定机制,通过在EfficientNet主干网络的最终扩展模块上实施选择性动量更新,构建一个缓慢演化的参考分支,以抵抗梯度引起的特征漂移,同时保留少数类的判别性特征;该策略与多尺度空间融合(1×1、3×3、5×5卷积)相结合,在长尾分布下有效维持了特征表示的稳定性,显著提升了罕见病灶如疝气(Hernia)和肺炎(Pneumonia)的分类性能。
链接: https://arxiv.org/abs/2605.02292
作者: Duy Hoang Khuong,Duy Nguyen Huu,Ngu Huynh Cong Viet
机构: FPT University (FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at FETC 2025
Abstract:Chest X-ray classification suffers from severe class imbalance where gradient updates bias toward majority classes, causing feature drift and poor performance on rare but critical pathologies. We propose a Momentum-Anchored Multi-Scale Fusion Network that uses exponential moving averages (EMA) as a temporal anchoring mechanism to stabilize feature representations under long-tailed distributions. Our approach applies selective momentum updates to the final expansion block of an EfficientNet backbone, creating a slowly-evolving reference branch that resists gradient-induced drift while preserving discriminative patterns for minority classes. Combined with multi-scale spatial fusion ( 1\times 1 , 3 \times 3 , 5 \times 5 convolutions), this anchoring strategy maintains representational stability throughout training. On ChestX-ray14, our method achieves 0.8682 average AUC, outperforming state-of-the-art approaches and showing particular improvements on rare pathologies like Hernia (0.9470) and Pneumonia (0.8165). The results demonstrate that momentum anchoring effectively counters feature instability in long-tailed medical image classification.
[CV-56] A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets
【速读】:该论文旨在解决合成数据与真实世界图像之间的sim2real外观差异问题,该差异限制了合成数据在实际计算机视觉任务中的应用效果。解决方案的关键在于提出一种混合方法,将基于扩散模型(如FLUX.2-4B Klein)强大的几何与材质变换能力与传统图像到图像翻译模型(如REGEN)在分布匹配方面的优势相结合,从而在保持语义一致性的前提下显著提升合成数据的视觉真实性。实验表明,REGEN在单一模型中表现优于FLUX.2-4B Klein,而两者联合使用时能进一步增强真实感。
链接: https://arxiv.org/abs/2605.02291
作者: Stefanos Pasios
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages
Abstract:Video game engines have been an important source for generating large volumes of visual synthetic datasets for training and evaluating computer vision algorithms that are to be deployed in the real world. While the visual fidelity of modern game engines has been significantly improved with technologies such as ray-tracing, a notable sim2real appearance gap between the synthetic and the real-world images still remains, which limits the utilization of synthetic datasets in real-world applications. In this letter, we investigate the ability of a state-of-the-art image generation and editing diffusion model (FLUX.2-4B Klein) to enhance the photorealism of synthetic datasets and compare its performance against a traditional image-to-image translation model (REGEN). Furthermore, we propose a hybrid approach that combines the strong geometry and material transformations of diffusion-based methods with the distribution-matching capabilities of image-to-image translation techniques. Through experiments, it is demonstrated that REGEN outperforms FLUX.2-4B Klein and that by combining both FLUX.2-4B Klein and REGEN models, better visual realism can be achieved compared to using each model individually, while maintaining semantic consistency. The code is available at: this https URL
[CV-57] LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory ICML2026
【速读】:该论文旨在解决自动化实验室部署中因环境设计困难而导致的瓶颈问题,特别是现有3D场景生成方法多面向家庭场景,侧重视觉真实性而忽视科学实验所需的严格功能语义与安全约束。其解决方案的关键在于提出一个端到端系统LabBuilder,包含三个紧密耦合的组件:首先通过LabForge构建包含标注资产和化学知识的元数据集,并将自然语言规范转化为结构化协议;随后基于这些协议,LabGen采用迭代式、约束感知的优化策略合成实验室布局;最后由LabTouchstone作为统一基准对生成布局进行验证。该方法显著优于现有最先进方法,在保证场景真实性的基础上实现了功能有效性和安全性,适用于复杂实验流程。
链接: https://arxiv.org/abs/2605.02288
作者: Jianbao Cao,Zhangrui Zhao,Bohan Feng,Zixuan Hu,Rui Li,Haiyuan Wan,Chenxi Li,Jingyuan Li,Wenzhe Cai,Lei Bai,Wanli Ouyang,Lingyu Duan,Di Huang,Mingting Pan,Sha Zhang,Xinzhu Ma,Shixiang Tang,Dongzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Automated laboratories hold the promise of accelerating scientific discovery, yet their deployment is bottlenecked by the difficulty of designing safe and executable environments. While simulator-based design offers scalability, existing 3D scene generation methods are primarily tailored for household settings, optimizing for visual plausibility while neglecting the rigorous functional semantics and safety constraints essential for scientific experimentation. We present LabBuilder, an end-to-end system that generates and verifies 3D laboratory layouts from concise textual specifications. It operates through three tightly coupled components: LabForge first curates a meta-dataset of annotated assets and chemical knowledge, translating natural language specifications into structured protocols; building on these protocols, LabGen synthesizes laboratory layouts via an iterative, constraint-aware optimization strategy; finally, LabTouchstone evaluates the resulting layouts as a unified benchmark. Extensive experiments demonstrate that LabBuilder significantly outperforms existing state-of-the-art methods, producing laboratory environments that are not only realistic but also functionally valid and safe for complex experimental workflows.
[CV-58] Beyond Known Objects: A Novel Framework for Open-Set Object Detection using Negative-Aware Norm
【速读】:该论文旨在解决开放集目标检测(Open-Set Object Detection, OSOD)中未知物体识别的难题,即在复杂动态环境中,感知系统需同时识别和定位已知类别与训练阶段未见的物体。传统方法通常依赖对基础检测器进行大规模重训练以学习“对象性”(objectness),即边界框包含有效物体的可能性,而不论其类别是否在训练集中出现过。本文提出一种轻量级框架NAN-SPOT,其核心创新在于不重新训练基础检测器,而是利用一个名为负向感知范数(Negative-Aware Norm, NAN)的隐藏层度量来估计对象性,仅需数百张图像、几分钟训练即可实现高效建模。此方案显著降低了训练成本并保持了对已知类别的检测性能,为自动驾驶中的开放世界感知提供了高效率且鲁棒的解决方案。
链接: https://arxiv.org/abs/2605.02284
作者: Yuchen Zhang,Yao Lu,Johannes Betz
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the IEEE Intelligent Vehicles Symposium (IV 2026), Detroit, MI, United States
Abstract:Open-Set Object Detection (OSOD) is crucial for autonomous driving, where perception systems must recognize and localize both known and previously unseen objects in complex, dynamic environments. While recent approaches deliver promising results, they often require retraining the detector extensively to learn objectness, which describes the likelihood that a bounding box tightly encloses a valid object, regardless of whether its category was learned during training. Deviating from existing work, we hypothesize that standard off-the-shelf detectors may already contain helpful cues for objectness, owing to their training on numerous and diverse known categories. Building on this idea, we propose NAN-SPOT, a training-light framework that does not require to retrain the base object detector and estimates objectness by leveraging a hidden layer metric called Negative-Aware Norm (NAN), requiring only minutes of training on just hundreds of images. To support comprehensive evaluation, we introduce COCO-Open, an expanded version of the existing COCO-Mixed dataset, increasing unknown object annotations from 433 to 1853, making it the most exhaustively labeled dataset for OSOD to the best of our knowledge. Experimental results demonstrate that NAN-SPOT achieves even better performance on unknown object detection than methods requiring heavy training, without compromising performance on known objects. This efficiency and robustness make NAN-SPOT a promising step towards open-world perception in autonomous driving.
[CV-59] Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM
【速读】:该论文旨在解决遥感图像检索任务中领域特定(EO-specific)与通用视觉基础模型(generalist vision foundation models)性能对比不清晰的问题,尤其关注EO预训练是否能带来更强的检索能力。其解决方案的关键在于设计了一个受控实验,使用相同的遥感数据集、检索协议和评估指标,系统比较了代表性EO特定模型与强通用模型在域内性能和跨场景泛化能力上的表现。结果表明,通用模型在多数情况下可媲美甚至超越EO特定模型,且具备更稳定的跨场景迁移能力,揭示了当前EO预训练策略的局限性,并强调未来模型需更好地利用遥感影像的物理、空间、光谱及地理特性。
链接: https://arxiv.org/abs/2605.02283
作者: Hyobin Park,Minseok Seo,Dong-Geol Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.
[CV-60] EdgeLPR: On the Deep Neural Network trade-off between Precision and Performance in LiDAR Place Recognition
【速读】:该论文旨在解决在资源受限平台(如边缘计算设备)上部署高效LiDAR-based place recognition模型的挑战,以支持长期自主导航中的回环检测与一致建图。其关键解决方案是利用鸟瞰图(Bird’s Eye View, BEV)表示将LiDAR数据转化为适合轻量级图像神经网络处理的形式,并采用统一的全局池化与线性投影描述符方案对代表性网络架构进行基准测试,同时在FP32、FP16和INT8量化级别下评估精度、鲁棒性与效率的权衡,结果表明FP16可实现与FP32相当的性能且成本更低,而INT8则表现出依赖于架构的性能下降,为面向特定应用场景的神经网络量化提供了重要依据。
链接: https://arxiv.org/abs/2605.02275
作者: Pierpaolo Serio,Hetian Wang,Zixiang Wei,Vincenzo Infantino,Lorenzo Gentilini,Lorenzo Pollini,Valentina Donzella
机构: University of Pisa (比萨大学); Queen Mary University of London (伦敦玛丽女王大学); Toyota Material Handling Manufacturing Italy (丰田物料搬运制造意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to CoDIT 2026
Abstract:Place recognition is essential for long-term autonomous navigation, enabling loop closure and consistent mapping. Although deep learning has improved performance, deploying such models on resource-constrained platforms remains challenging. This work explores efficient LiDAR-based place recognition for EdgeAI by leveraging Bird’s Eye View representations to enable lightweight image-based networks. We benchmark representative architectures without aggregation heads using a unified descriptor scheme based on global pooling and linear projection, and evaluate performance under FP32, FP16, and INT8 quantization. Experiments reveal trade-offs between accuracy, robustness, and efficiency: FP16 matches FP32 with lower cost, while INT8 introduces architecture-dependent degradation. Overall, the presented results are a strong basis for future research on ‘use-case’-aware quantisation of Neural Networks for Edge deployment.
[CV-61] SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFM)在多光谱成像领域(涵盖近红外NIR、短波红外SWIR和长波红外LWIR)应用时存在的谱段差距问题,即现有基于RGB数据预训练的模型难以有效迁移到非可见光谱模态。解决方案的关键在于提出SpectraDINO,通过轻量级的每模态瓶颈适配器(bottleneck adapters)将DINOv2 ViT骨干网络扩展至超越可见光的光谱范围,同时冻结原始RGB骨干以保留其丰富表征;并设计多阶段教师-学生训练协议,利用余弦蒸馏、对称对比损失、patch级对齐以及一种新颖的邻域结构保持损失,在不遗忘RGB先验的前提下实现跨模态强对齐,从而构建一个适用于多光谱泛化的通用骨干模型。
链接: https://arxiv.org/abs/2605.02258
作者: Yagiz Nalcakan,Hyeongjin Ju,Incheol Park,Sanghyeop Yeo,Youngwan Jin,Shiho Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB-centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond-visible modalities through lightweight, per-modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi-stage teacher-student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch-level alignment, and a novel neighborhood-structure-preservation loss. This staged curriculum enables strong cross-modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state-of-the-art performance across most benchmarks, validating its effectiveness as a general-purpose backbone for spectral generalization. The code and weights for model variants are available at this https URL.
[CV-62] Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning CVPR2026
【速读】:该论文旨在解决长尾分布下的个性化联邦学习(Personalized Federated Learning, PFL)中两个核心问题:一是传统微调方法因破坏基础模型(foundation model)固有的类别平衡性而导致性能低于零样本(zero-shot)基线;二是现有个性化技术在参数或特征层面融合时,将全局模型中的类别偏差传递至本地模型,造成偏倚累积。解决方案的关键在于提出FedPuReL框架,其核心创新为:通过零样本预测对本地梯度进行净化(gradient purification),以维持全局模型的类别平衡;并将个性化建模视为在冻结的全局模型基础上的残差修正(residual learning),从而实现无偏的本地适应。此设计有效分离了通用知识与个体差异,显著提升了长尾场景下全局与个性化模型的性能表现。
链接: https://arxiv.org/abs/2605.02247
作者: Shihao Hou,Chikai Shang,Zhiheng Yang,Jiacheng Yang,Xinyi Shang,Junlong Gao,Yiqun Zhang,Yang Lu
机构: Xiamen University (厦门大学); University College London (伦敦大学学院); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Personalized federated learning (PFL) with foundation models has emerged as a promising paradigm enabling clients to adapt to heterogeneous data distributions. However, real-world scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. To address these challenges, we propose Federated Learning via Gradient Purification and Residual Learning (FedPuReL), which preserves balanced knowledge in the global model while enabling unbiased personalization. Specifically, we purify local gradients using zero-shot predictions to maintain a class-balanced global model, and model personalization as residual correction atop the frozen global model. Extensive experiments demonstrate that FedPuReL consistently outperforms state-of-the-art methods, achieving superior performance on both global and personalized models across diverse long-tailed scenarios. The code is available at this https URL.
[CV-63] InfiltrNet: Dual-Branch CNN-Transformer Architecture for Brain Tumor Infiltration Risk Prediction
【速读】:该论文旨在解决胶质瘤(Gliomas)在磁共振成像(MRI)中可见边界之外的浸润范围预测问题,这是影响手术规划和放疗精度的关键挑战。现有深度学习方法主要聚焦于分割可见肿瘤区域,而忽视了对周围组织浸润风险的定量评估。解决方案的核心在于提出InfiltrNet——一种双分支架构,通过交叉注意力融合模块将卷积神经网络(CNN)编码器与Swin Transformer编码器相结合,从而从多模态MRI数据中生成三个区域的浸润风险图。此外,作者设计了一种基于距离变换的标签生成策略,从标准脑肿瘤分割(BraTS)标注中自动推导可重复的浸润风险分区,并引入Dice-CrossEntropy与边界感知损失的组合优化方案,辅以中间解码层的辅助监督头,显著提升了模型性能与可解释性。
链接: https://arxiv.org/abs/2605.02230
作者: S M Asif Hossain,Shruti Kshirsagar
机构: Wichita State University (威奇托州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review at IEEE SMC 2026
Abstract:Gliomas are aggressive brain tumors that infiltrate surrounding tissue beyond the visible tumor margins observed on Magnetic Resonance Imaging (MRI). Predicting the spatial extent of this infiltration is essential for surgical planning and radiation therapy, yet existing deep learning approaches focus on segmenting the visible tumor rather than estimating infiltration risk in the surrounding tissue. This paper presents InfiltrNet, a novel dual-branch architecture that combines a convolutional neural network (CNN) encoder with a Swin Transformer encoder through cross-attention fusion modules to predict three-zone infiltration risk maps from multimodal MRI. A label generation strategy based on distance transforms is proposed to derive reproducible infiltration risk zones from standard Brain Tumor Segmentation (BraTS) annotations. InfiltrNet is trained with a combined Dice-CrossEntropy and boundary-aware loss augmented by auxiliary supervision heads at intermediate decoder levels. Extensive experiments on BraTS 2020 and BraTS 2025 demonstrate that InfiltrNet outperforms five established baselines. Explainability analysis using GradCAM++ and Occlusion sensitivity confirms that the model attends to clinically relevant peritumoral regions.
[CV-64] oward Fine-Grained Speech Inpainting Forensics:A Dataset Method and Metric for Multi-Region Tampering Localization
【速读】:该论文旨在解决生成式语音伪造检测中一个关键但未被充分研究的问题:即在部分语音篡改(partial speech manipulation)场景下,如何准确检测并定位多个未知数量的、以词为单位的插入篡改区域(inpainting segments),而现有基准测试主要集中在全句级二分类或单一区域篡改任务上。解决方案的关键在于三个方面:首先,构建了MIST数据集,涵盖6种语言、每条语音包含1-3个独立词级篡改段,且篡改内容仅占2-7%,模拟真实世界中隐蔽性更强的攻击;其次,提出ISA(Iterative Segment Analysis)框架,通过粗粒度到细粒度的滑动窗口分类、容忍间隔的区域提案与边界精修机制,在无需预先知道篡改区域数量的情况下实现多区域恢复;最后,定义SF1@tau指标,基于时间IoU匹配评估区域计数准确性与定位精度的联合性能,推动该领域从“是否为假”向“哪里被篡改”的精细化检测演进。
链接: https://arxiv.org/abs/2605.02223
作者: Tung Vu,Yen Nguyen,Hai Nguyen,Cuong Pham,Cong Tran
机构: Posts and Telecommunications Institute of Technology (电信技术研究所)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker’s identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.
[CV-65] Generative Modeling with Orbit-Space Particle Flow Matching
【速读】:该论文旨在解决粒子系统生成建模中因粒子排列对称性导致的流匹配学习困难以及物理空间中几何信息难以有效编码的问题。其核心解决方案是提出Orbit-Space Geometric Probability Paths(OGPP),关键在于:(1)通过轨道空间规范化的概率路径终点,消除粒子索引带来的冗余方差;(2)引入粒子索引嵌入实现角色专业化;(3)设计弧长感知的终端速度的几何概率路径,使表面法向量成为流过程的副产物,从而在保持物理意义的同时提升生成质量与效率。
链接: https://arxiv.org/abs/2605.02222
作者: Sinan Wang,Jinjin He,Shenyifan Lu,Ruicheng Wang,Greg Turk,Bo Zhu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Orbit-Space Geometric Probability Paths (OGPP), a particle-native flow-matching framework for generative modeling of particle systems. OGPP is motivated by two insights: (i) particles are defined up to permutation symmetries, so anonymous indexing inflates per-index target variance and yields curved, hard-to-learn flows; and (ii) particles live in physical space, so the flow terminal velocity has physical meaning and can encode geometric attributes, e.g., surface normals. OGPP instantiates three key components: (1) orbit-space canonicalization of the probability-path terminal endpoint, (2) particle index embeddings for role specialization, and (3) geometric probability paths with arc-length-aware terminal velocities that generate normals as a byproduct of the flow. We evaluate OGPP on minimal-surface benchmarks, where it reduces metric error by up to two orders of magnitude in a single inference step; on ShapeNet, where it matches the state of the art with 5x fewer steps and reaches airplane EMD comparable to DiT-3D with 26x fewer parameters and 5x fewer steps; and on single-shape encoding, where it produces normals and reconstructions competitive with 6D generators while operating entirely in 3D.
[CV-66] NTIRE 2026 Challenge on Efficient Low Light Image Enhancement: Methods and Results
【速读】:该论文旨在解决移动设备在低光照条件下图像增强的难题,核心挑战在于设计轻量级网络,在有限计算资源下实现高质量的图像增强效果。解决方案的关键在于通过NITRE 2026高效低光图像增强(Efficient Low Light Image Enhancement, E-LLIE)挑战赛收集并系统评估多种前沿方法,从而推动生成式AI(Generative AI)与轻量化模型架构的融合创新,显著提升低光图像增强在性能和效率上的平衡表现。
链接: https://arxiv.org/abs/2605.02212
作者: Jiebin Yan,Chenyu Tu,Weixia Zhang,Zhihua Wang,Peibei Cao,Qinghua Lin,Yuming Fang,Xiaoning Liu,Zongwei Wu,Zhuyun Zhou,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a comprehensive review of the NITRE 2026 Efficient Low Light Image Enhancement (E-LLIE) Challenge, highlighting the proposed solutions and final outcomes. This challenge focuses on mobile image enhancement under low-light conditions, aiming to design lightweight networks that improve enhancement quality while ensuring practical deployability under limited computational resources. A total of 207 participants registered, 27 teams submitted valid entries, and 17 teams ultimately provided valid factsheet. Based on these submissions, this paper provides a systematic evaluation of recent methods for E-LLIE, offering a comprehensive overview of state-of-the-art progress and demonstrating significant improvements in both performance and efficiency.
[CV-67] MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings
【速读】:该论文旨在解决资源匮乏地区肺炎筛查与分诊支持不足的问题,尤其是在影像学、实验室检测和专科医疗资源有限的情况下。现有计算方法多为单模态,主要依赖胸部X光片(chest radiographs),难以全面反映临床复杂性。其解决方案的关键在于提出MultiSense-Pneumo框架,这是一个多模态(multimodal)集成系统,融合结构化症状描述、咳嗽音频、语音语言特征和胸部X光片,通过标准化风险信号生成统一的筛查估计值,并采用可解释的多模态融合机制,实现透明且模块化的决策支持。该系统可在标准笔记本电脑硬件上离线运行,适用于基层卫生工作者、农村诊所及应急场景,具备良好的实用性与部署可行性。
链接: https://arxiv.org/abs/2605.02207
作者: Dineth Jayakody,Pasindu Thenahandi,Chameli Dommanige
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Pneumonia remains a leading global cause of morbidity and mortality, particularly in low resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, and chest imaging, making screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal framework for pneumonia oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM based acoustic classification, domain adversarial radiograph analysis using ResNet 18, transformer based speech recognition, and an interpretable multimodal fusion operator. Each modality is transformed into a normalized risk signal and aggregated into a unified screening estimate, enabling transparent and modular decision support. MultiSense-Pneumo is designed for real world deployment under modest computational constraints and can operate fully offline on standard laptop class hardware, making it suitable for community health workers, rural clinics, and emergency response settings. Experimental results demonstrate robustness of the radiograph pathway under domain shifts, while highlighting limitations in minority class recall for acoustic signals. MultiSense-Pneumo is intended as a research prototype for screening and triage support rather than a clinically validated diagnostic system.
[CV-68] Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score NEURIPS2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在满足《通用数据保护条例》(GDPR)要求时,因缺乏一致可靠的评估指标而导致的“机器遗忘”(Machine Unlearning)效果难以量化的问题。现有五种标准指标(Forget Accuracy、Retain Accuracy、Membership Inference Attack、Activation Distance 和 JS 散度)在多个多模态问答基准上产生相互矛盾的方法排序,且在多模态任务中一致性显著低于单模态分类任务。解决方案的关键在于提出一种统一质量评分(Unified Quality Score, UQS),其权重由各指标与理想模型(仅保留目标数据重新训练的参考模型 M_star)之间距离 d(M_hat, M_star) 的 Spearman 相关系数决定;其中 Retain Accuracy 表现出最强可靠性(rho = 0.484),而 Forget Accuracy 呈负相关(rho = -0.418),UQS 在随机权重扰动下仍保持稳定排名(tau = 0.647 ± 0.262),从而提供了一个更稳健、可复现的多模态遗忘评估框架。
链接: https://arxiv.org/abs/2605.02206
作者: Abdullah Ahmad Khan,Hamid Laga,Ferdous Sohel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 Pages , 6 figures, Neurips 2026
Abstract:Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, FA, RA, MIA and AD, JS, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric’s Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 ± 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at this https URL.
[CV-69] Super-resolution of airborne laser scanning point clouds for forest inventory
【速读】:该论文旨在解决机载激光扫描(Airborne Laser Scanning, ALS)点云因稀疏性和噪声问题导致的个体树木水平森林清查不准确的问题,如树干定位和树体尺寸估计误差较大。解决方案的关键在于提出一种基于体素的卷积神经网络(CNN)模型——3D Forest Super Resolution (3DFSR),其采用U-Net架构,能够同时提升ALS点云的密度并降低噪声。该方法在北美温带森林和德国北方森林的多场景数据上验证有效,显著优于现有超分辨率算法,并且增强后的点云可直接用于TLS/MLS点云中开发的树干检测、胸径(DBH)测量与树干重建算法,从而大幅提升森林清查精度。
链接: https://arxiv.org/abs/2605.02201
作者: Jinyuan Shao,Sangyoong Park,Chunxi Zhao,Ayman Habib,Songlin Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Airborne Laser Scanning (ALS) can collect point clouds across large areas, enabling large-scale forest inventory. However, ALS point clouds are sparse and noisy, resulting in inaccurate individual-tree-level forest inventory, such as stem localization and tree size estimation. To overcome this problem, we propose a deep learning model, 3D Forest Super Resolution (3DFSR), to simultaneously improve point density and reduce noise for ALS forest point cloud. 3DFSR is a voxel-based CNN with a U-Net architecture. The proposed 3DFSR is evaluated on ALS point clouds collected in both temperate forests in the U.S. and boreal forests in Germany. Experimental results demonstrate that 3DFSR can generate finer point clouds of tree structure than other state-of-the-art point cloud super-resolution algorithms, achieving 0.249 m Chamfer Distance and 2.711 m Hausdorff Distance. Furthermore, to verify the effectiveness of 3DFSR point clouds in forest inventory, we conduct stem detection, DBH measurements, and stem reconstruction on both original ALS point clouds and 3DFSR enhanced point clouds. We find that stem detection and reconstruction algorithms developed for TLS/MLS point clouds can directly work on our 3DFSR point clouds, and DBH can be derived with circle-fitting method. F1 score of stem detection is improved from 0.71 on original ALS point clouds to 0.97 on 3DFSR point clouds; DBH estimation improves from 13.45 cm RMSE using allometric equations to 6.43 cm using circle fitting; comparing to stems reconstruction from MLS point clouds, stem reconstructed from 3DFSR point clouds has 0.170 m of Chamfer Distance and 0.377 m of Hausdorff Distance, and 0.95 R2 volume estimation. Finally, we find that the proposed 3DFSR is applicable to process point densities from 10 to 1700 points/m2; it also can be generalized across data collected from different LiDAR platforms without transfer learning.
[CV-70] SlimDiffSR: Toward Lightweight and Efficient Remote Sensing Image Super-Resolution via Diffusion Model Distillation
【速读】:该论文旨在解决扩散模型在遥感图像超分辨率(Remote Sensing Image Super-Resolution, RSISR)应用中因计算成本高而导致的实际部署困难问题。解决方案的关键在于提出了一种轻量化且高效的扩散框架SlimDiffSR,其核心创新包括:(1) 引入不确定性引导的 timestep 分配策略构建更强的单步教师模型,使重建难度与扩散时间步显式关联,实现生成强度的自适应调节;(2) 设计面向遥感图像特性的结构化剪枝策略,系统性移除冗余语义模块,并以频域可分离卷积(frequency-separable convolution)、方向可分离卷积(direction-separable convolution)和查询驱动全局聚合模块(query-driven global aggregation module)替代标准操作,从而充分利用遥感数据中稀疏高频细节、强方向性模式及长程空间依赖等特性;(3) 在蒸馏过程中引入最大均值差异(Maximum Mean Discrepancy, MMD)以对齐教师与学生模型的特征分布,提升知识迁移效率。实验表明,该方法在保持竞争性感知质量的同时,相较多步扩散模型实现了最高达200倍的推理加速和20倍的参数压缩。
链接: https://arxiv.org/abs/2605.02198
作者: Ce Wang,Zhenyu Hu,Wanjie Sun
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to 200\times inference acceleration and a 20\times reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: this https URL.
[CV-71] RAFNet: Region-Aware Fusion Network for Pansharpening
【速读】:该论文旨在解决多光谱图像(Multispectral, MS)与全色图像(Panchromatic, PAN)融合过程中存在的两大瓶颈问题:一是传统基于频率域的方法依赖标准缩放点积注意力机制,导致计算复杂度呈二次增长,且无法利用遥感影像固有的区域稀疏性;二是现有空间增强策略采用静态卷积核,难以适应PAN和MS图像在频域与空间区域上的复杂变化。解决方案的关键在于提出一种区域感知融合网络(Region-Aware Fusion, RAFNet),其核心创新包括两个模块:一是空间自适应精炼模块(Spatial Adaptive Refinement, SAR),通过离散小波变换(Discrete Wavelet Transform, DWT)实现方向性频域分离,并结合K-means聚类进行区域划分,从而动态构建区域特异的自适应卷积核,实现空间与频域协同的特征增强;二是聚类引导的频域聚合模块(Clustered Frequency Aggregation, CFA),基于语义聚类引导的稀疏注意力机制,执行区域感知的稀疏注意力策略,在显著降低计算冗余的同时保障高质量频域特征提取。两者共同嵌入到渐进式多层级空间-频率网络架构中,促进鲁棒交互与精准重建。
链接: https://arxiv.org/abs/2605.02184
作者: Jianing Zhang,Zijian Zhou,Kai Sun
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 10 figures
Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images. Although deep learning has advanced this field, mainstream frequency-based methods relying on standard scaled dot-product attention suffer from quadratic computational complexity and fail to exploit the inherent regional sparsity of remote sensing imagery. Furthermore, existing spatial enhancement strategies typically employ static convolution kernels, which struggle to adapt to the complex frequency and regional variations of PAN and MS images. To address these bottlenecks, we propose a Region-Aware Fusion (RAFNet) Network that synergistically models spatial and frequency information. Specifically, we design a Spatial Adaptive Refinement (SAR) module that leverages the discrete wavelet transform (DWT) for directional frequency separation and K-means clustering for regional partitioning, which enables the dynamic construction of region-specific adaptive convolution kernels, achieving spatially and frequency-adaptive feature enhancement. Moreover, we introduce a Clustered Frequency Aggregation (CFA) module based on a sparse attention mechanism guided by the semantic clusters, which executes a region-aware sparse attention strategy that drastically reduces computational redundancy while ensuring high-quality frequency feature extraction. In addition we integrated these modules into a progressive, multi-level spatial-frequency network architecture to facilitate robust interaction and accurate image reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that the proposed RAFNet significantly outperforms state-of-the-art pansharpening methods in both reduced- and full-resolution assessments. The code is available at this https URL.
[CV-72] Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
【速读】:该论文旨在解决多摄像头场景下跨域目标检测中的隐私保护、类别不平衡及异构架构兼容性问题(privacy-preserving, class imbalance, and heterogeneous architectures)。其核心解决方案为HeroCrystal框架,包含三个关键阶段:首先,在生成阶段引入基于扩散模型的一次性、目标感知生成模块,通过单张目标域图像学习视觉风格并结合提示控制合成特定对象实例,实现隐私友好的数据增强与罕见类别的可控生成;其次,在联邦阶段采用概率化Faster R-CNN和动态模型对比策略,提升定位精度并抑制域特异性偏差,服务器端在不接触原始数据的情况下融合异构模型;最后,在蒸馏阶段提出不一致类别集成算法,有效解决客户端间标签不一致与结构异构问题。实验表明,该方法在多个跨域检测基准上显著优于现有联邦学习与多源域适应基线,mAP提升2.1%,达到33.4%的新SOTA水平。
链接: https://arxiv.org/abs/2605.02169
作者: Peggy Joy Lu,Wei-Yu Chen,Yao-Tsung Huang,Vincent Shin-Mu Tseng
机构: National Taiwan University of Science and Technology (台湾科技大学); National Chengchi University (国立政治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 42 pages, 13 figures. Published in Information Fusion (Elsevier). DOI: https://doi.org/10.1016/j.inffus.2026.104413
Abstract:We propose HeroCrystal, a novel privacy-preserving framework for multi-camera domain-adaptive object detection, addressing challenges such as data privacy, class imbalance, and heterogeneous architectures. Our framework consists of three key stages. In the Generated Stage, we introduce a one-shot, target-aware diffusion-based generation module that learns visual style from a single target-domain image while leveraging prompt-based control to synthesize specific object instances. Unlike conventional style transfer-based methods that require large target datasets and ignore semantic-level discrepancies, our approach enables privacy-preserving augmentation to reduce ethical concerns, and introduces controllable rare object generation to mitigate long-tailed category degradation. In the Federated Stage, we employ probabilistic Faster R-CNN on the client side to improve localization accuracy, and a dynamic model contrastive strategy to suppress domain-specific bias. The server side performs model fusion across heterogeneous architectures without accessing raw data. Finally, in the Distilled Stage, we propose an inconsistent categories integration algorithm to resolve label inconsistency and architecture heterogeneity across clients. Extensive experiments on multiple cross-domain detection benchmarks demonstrate that our method outperforms existing multi-source domain adaptation and federated learning baselines under multi-class, privacy-preserving settings. Our method improves mAP by +2.1% over prior privacy-preserving approaches and achieves a new state-of-the-art mAP of 33.4%, highlighting the effectiveness of HeroCrystal in enabling practical multi-camera AI surveillance systems.
[CV-73] Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution ICML2026
【速读】:该论文旨在解决集成梯度(Integrated Gradients, IG)在解释深度神经网络时因积分路径穿越梯度噪声区域而导致解释不可靠的问题。现有方法如引导集成梯度(Guided Integrated Gradients)虽通过自适应更新低梯度幅值特征缓解了部分敏感性,但仍存在中间输入偏离数据流形(data manifold)的问题。其解决方案的关键在于提出流形对齐引导集成梯度(Manifold-Aligned Guided Integrated Gradients, MA-GIG),该方法在预训练变分自编码器(Variational Autoencoder, VAE)的潜在空间中构建归因路径,并通过解码中间潜在状态将路径导向学习到的生成流形,从而减少对不合理输入空间区域的暴露。MA-GIG通过聚合靠近输入的路径特征上的梯度,提升了归因的忠实性,有效降低了离流形噪声,在多个数据集和分类器上优于现有基于路径的归因方法。
链接: https://arxiv.org/abs/2605.02167
作者: Soyeon Kim,Seongwoo Lim,Kyowoon Lee,Jaesik Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures, 12 tables. Accepted to ICML 2026; includes appendix
Abstract:Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emphManifold-Aligned Guided Integrated Gradients (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at this https URL.
[CV-74] Cross-Polarization Fusion of VV AND VH SAR Observations for Improved Flood Mapping
【速读】:该论文旨在解决复杂环境中利用单极化合成孔径雷达(Synthetic Aperture Radar, SAR)数据进行洪水制图的挑战,特别是在地表散射与体积散射共存的情况下,单一极化信息难以准确识别洪水边界的问题。解决方案的关键在于采用深度学习驱动的分割框架,联合利用VV和VH交叉极化SAR观测中的互补信息,通过融合VV与VH通道输入实现更鲁棒的洪水区域识别,实验表明该方法在植被覆盖和异质性较强的区域显著提升了洪水边界的刻画精度,验证了交叉极化融合对提升SAR洪水制图可靠性的有效性。
链接: https://arxiv.org/abs/2605.02153
作者: Jagrati Talreja,Tewodros Syum Gebre,Leila Hashemi Beni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
Abstract:Synthetic Aperture Radar (SAR) imagery is widely used for flood monitoring due to its all-weather and day-night imaging capability. However, flood mapping using single-polarization SAR data remains challenging in complex environments where surface and volume scattering coexist. In this paper, we investigate the effectiveness of cross-polarization fusion of VV and VH SAR observations for improved flood mapping. A deep learning-based segmentation framework is employed to jointly exploit complementary information from VV and VH polarizations. To ensure a fair evaluation, three configurations are compared under identical training conditions: VV only, VH only, and fused VV-VH input. Performance is assessed using standard flood mapping metrics, including Intersection over Union (IoU) and F1-score, along with qualitative visual analysis. Experimental results demonstrate that VV-VH fusion consistently outperforms single-polarization models, particularly in vegetated and heterogeneous flood regions, leading to more accurate flood boundary delineation. The findings highlight the importance of cross-polarization SAR fusion for enhancing the reliability of SAR-based flood mapping in disaster monitoring applications.
[CV-75] SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
【速读】:该论文旨在解决基于扩散模型(diffusion model)的图像编辑任务中计算成本高昂的问题,特别是由于在所有空间标记(spatial tokens)上进行迭代高分辨率去噪所导致的效率瓶颈。现有动态分辨率采样方法依赖于低级启发式策略(如边缘检测或通道方差)进行上采样,这些策略与编辑语义对齐度弱,易引发结构不一致;且未验证是否真正需要语义修改即对区域进行上采样,造成冗余计算和误差累积。其解决方案的关键在于提出一种无需训练的动态分辨率框架 SpecEdit,采用“草稿-验证”(draft-and-verify)机制:首先在低分辨率下生成语义预测,再通过标记级别的差异识别出与编辑相关的标记进行高分辨率去噪,其余标记保持粗粒度表示,从而显著降低计算开销并保持高质量输出。
链接: https://arxiv.org/abs/2605.02152
作者: Zhengan Yan,Shikang Zheng,Haoran Qin,Xiaobing Tu,Yinggui Wang,Jiacheng Liu,Jiaxuan Ren,Yuqi Lin,Peiliang Cai,Jinkui Ren,Xiantao Zhang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group; Shandong University; UESTC; Jilin University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper with supplementary material; figures and tables included
Abstract:Diffusion-based image editing offers strong semantic controllability, but remains computationally expensive due to iterative high-resolution denoising over all spatial tokens. Dynamic-resolution sampling reduces this cost by performing early steps at reduced resolution. However, existing approaches prioritize upsampling using low-level heuristics such as edge detection or channel variance, which are weakly aligned with editing semantics and may lead to structural inconsistency. Moreover, spatial regions are often upsampled without verifying whether semantic modification is actually required, resulting in redundant high-resolution computation and accumulated errors. Therefore, we propose SpecEdit, a training-free dynamic-resolution framework tailored for diffusion-based image editing. SpecEdit follows a draft-and-verify scheme: a low-resolution draft first estimates the semantic outcome, after which token-level discrepancies are used to identify edit-relevant tokens for high-resolution denoising, while the remaining tokens stay at a coarse resolution. Experiments on Qwen-Image-Edit and FLUX.1-Kontext-dev demonstrate up to 10x and 7x acceleration, while maintaining strong quality. SpecEdit is complementary to step distillation and other acceleration techniques, achieving up to 13x speedup when combined with existing methods. Our code is in supplementary material and will be released on GitHub.
[CV-76] FLoRA: Fusion-Latent for Optical Reconstruction and Flood Area Segmentation via Cross-Modal Multi-Task Distillation Network
【速读】:该论文旨在解决洪水水体制图中因单一模态遥感数据局限性导致的精度不足问题,即光学影像虽具高可解释性但受天气条件限制,而合成孔径雷达(SAR)虽具备全天候观测能力却存在视觉可解释性差的问题。解决方案的关键在于提出FLoRA框架——一种基于教师引导潜在空间的跨模态多任务学习方法,通过轻量级光学教师模型(利用RGB和归一化植被指数NDVI先验)提供分层特征,借助多尺度窗口交叉注意力与FiLM条件控制将SAR表征映射至融合潜在空间,并引入门控残差机制防止过校正;该设计实现了两个互补目标的联合优化:(a) SAR到光学图像的翻译以实现精细RGB重建,(b) 洪水区域分割以支持水文解释,其中双解码器分别采用Charbonnier SSIM、FFT边缘幅度损失及Dice BCE水文感知边对齐损失进行优化,最终显著提升了洪水制图的语义一致性和物理合理性。
链接: https://arxiv.org/abs/2605.02137
作者: Jagrati Talreja,Tewodros Syum Gebre,Leila Hashemi-Beni
机构: North Carolina AT Technical State University (北卡罗来纳州立技术大学); United Nations University (联合国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE Journal
Abstract:Accurate flood water mapping is critical for disaster management, yet current methods struggle to fully exploit the potential of spaceborne imagery. Optical data offers high interpretability but is limited by environmental conditions, whereas SAR provides reliable all-weather coverage with reduced visual interpretability. FLoRA (Fusion Latent for Optical Reconstruction and Area Segmentation) is a cross-modal multi-task framework that jointly reconstructs high-fidelity optical imagery and segments flood water regions from Sentinel 1 SAR by fusing the complementary strengths of optical and SAR data. During training, a lightweight optical teacher (driven by RGB and NDVI priors) provides pyramidal features that guide SAR representations into a fusion latent space via multiscale windowed cross attention and FiLM conditioning, with gated residuals preventing overcorrection. This design enables multi-task learning across two complementary objectives: (a) SAR-to-optical translation for fine-grained RGB reconstruction and (b) flood water region segmentation for hydrologic interpretation. The dual decoders are optimized using Charbonnier SSIM for structural fidelity, edge FFT magnitude losses for spectral realism, and Dice BCE hydrology-aware edge alignment for precise flood water delineation. A feature distillation constraint further aligns fused SAR features with the optical teacher’s manifold. Evaluations on SEN1FLOODS11, DEEPFLOOD, and SEN12MS demonstrate that FLoRA surpasses fusion baselines in PSNR, SSIM, and LPIPS, demonstrating that multi-modal fusion within a teacher-guided latent space yields semantically faithful and physically consistent flood-water intelligence from spaceborne observations.
[CV-77] Video Generation with Predictive Latents
【速读】:该论文旨在解决视频变分自编码器(Video Variational Autoencoder, Video VAE)中潜在空间扩散性(diffusability)不足的问题,即现有方法在提升重建质量的同时,并未显著改善生成性能。其核心挑战在于如何使潜在空间更好地捕捉视频的时序动态结构,从而增强生成质量与下游任务表现。解决方案的关键在于引入一种简洁有效的预测重构目标(predictive reconstruction objective),通过随机丢弃未来帧并仅编码部分历史观测,训练解码器同时完成已知帧的重构与未来帧的预测。这一设计促使潜在空间学习到具有时间预测能力的结构,从而建立对视频动态更连贯的理解,显著提升了生成质量和可扩展性。
链接: https://arxiv.org/abs/2605.02134
作者: Yian Zhao,Feng Wang,Qiushan Guo,Chang Liu,Xiangyang Ji,Jian Zhang,Jie Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
[CV-78] From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLM s CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估中仅聚焦于低阶几何感知能力,而缺乏对更高阶认知能力——即具身智能(grounded intelligence)所需的空间功能推理能力的系统性测评问题。其解决方案的关键在于提出一个基于视频的基准测试工具Spatial-Functional Intelligence Benchmark (SFI-Bench),该基准包含超过1500个专家标注的问题,源自多样化的第一人称室内场景视频扫描,能够系统评估两类互补的高级推理维度:结构化空间推理(Structured Spatial Reasoning)与功能推理(Functional Reasoning)。通过引入条件计数、多跳关系推理、功能配对及知识驱动故障排除等任务,SFI-Bench直接检验模型整合感知、记忆与推理的能力,从而为识别和诊断MLLMs在实现真正具身智能过程中的瓶颈提供量化依据。
链接: https://arxiv.org/abs/2605.02130
作者: Le Zhang,Jihan Yang,Soundarya Krishnan,Jimit Majmudar,Xiou Ge,Prasoon Puri,Prathamesh Nandkishor Saraf,Shruti Bhargava,Dhivya Piraviperumal,Yinan Ling,Cindy Pan,Hong Yu,Aishwarya Agrawal,Bo-Hsiang Tseng
机构: Mila - Québec AI Institute, UdeM; NYU; Apple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.
[CV-79] Ultrasound Vision-Language Alignment via Contrastive Learning
【速读】:该论文旨在解决超声基础模型(Ultrasound foundation models)当前仅限于视觉任务、难以在缺乏特定任务标注数据的情况下实现零样本(zero-shot)或少样本(few-shot)迁移的问题。其关键解决方案是提出EchoCare-CLIP,一种类CLIP的双编码器对比学习框架,通过将超声图像与临床文本对齐于共享嵌入空间,从而建立跨模态语义关联。该方法利用包含超过16,000对图像-文本的多器官数据集(涵盖乳腺、肝脏、肺和甲状腺),其中78%以上文本来自专家标注报告,并结合三阶段模板生成与大语言模型(LLM)生成策略补充剩余标注,最终验证了从公开数据中实现高质量超声视觉-语言对齐的可行性,同时强调了在领域适应性与表征通用性之间权衡的重要性。
链接: https://arxiv.org/abs/2605.02126
作者: Zhuoyang Lyu,Yiyang Zhang,Tongxin Wang,Ruirui Lan
机构: Harvard Medical School (哈佛医学院); Harvard T. H. Chan School of Public Health (哈佛大学陈曾熙公共卫生学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.
[CV-80] From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments
【速读】:该论文旨在解决大规模3D点云数据在处理过程中因传统球形裁剪(spherical cropping)导致周围几何上下文丢失的问题,从而影响场景语义理解的准确性。其解决方案的关键在于提出三种替代性裁剪策略——指数型(exponential)、高斯型(Gaussian)和线性型(linear)裁剪方法,这些方法能够在保持与球形裁剪相近点数的前提下,生成更大尺寸的子点云,从而保留更丰富的空间上下文信息。实验表明,这种裁剪策略的调整可显著提升模型性能,尤其在大规模室外场景中表现突出,实现了新的SOTA(state-of-the-art)结果。
链接: https://arxiv.org/abs/2605.02098
作者: Maximilian Kellner,Dominik Merkle,Michael Brunklaus,Alexander Reiterer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale 3D point clouds can consist of billions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated two 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at this https URL
[CV-81] SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition ICPR2026
【速读】:该论文旨在解决手语识别中因手势细微差异导致的识别挑战问题,现有方法通常依赖于在通用动作数据集上预训练的编码器,难以捕捉此类细粒度特征。其解决方案的关键在于提出一种基于分割掩码(segmentation-based masking)的自监督预训练方法,该方法通过建模关键身体部位的存在与运动而非将手部姿态视为静态视觉标记,从而优化掩码重建目标,提升细粒度手语表征学习能力。
链接: https://arxiv.org/abs/2605.02094
作者: Kunyuan Xie,Zhixi Cai,Kalin Stefanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICPR 2026
Abstract:Subtle hand differences make sign language recognition challenging, yet many existing methods rely on encoders pretrained on generic action datasets that poorly capture such fine-grained cues. We propose a self-supervised pretraining method for sign language recognition that uses segmentation-based masking to adapt to the presence and motion of key body parts, rather than treating hand poses as static visual tokens. The resulting mask-and-reconstruct objective improves fine-grained sign representation learning. On WLASL, NMFs-CSL, and Slovo, our encoder achieves state-of-the-art performance, improving per-instance and per-class Top-1 accuracy while using fewer input frames and modalities than comparable encoders.
[CV-82] Cross-Language Learning within Arabic Script for Low-Resource HTR
【速读】:该论文旨在解决低资源环境下阿拉伯文系手写文本识别(Handwritten Text Recognition, HTR)中因标注数据稀缺导致模型性能显著下降的问题。其解决方案的关键在于利用阿拉伯文系语言(如阿拉伯语、乌尔都语和波斯语)在书写系统上的高度字符重叠特性,通过跨脚本联合训练(cross-script joint training)实现知识迁移,从而缓解数据不足问题。实验表明,这种迁移效果主要由脚本间共享字符的结构化增益驱动,而语言特有字符则可能带来有限甚至负向迁移,这为低资源脚本家族中的迁移学习机制提供了重要洞见。
链接: https://arxiv.org/abs/2605.02089
作者: Sana Al-azzawi,Elisa Barney,Marcus Liwicki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handwritten Text Recognition (HTR) under limited labeled data remains a challenging problem, particularly for Arabic-script languages. Although modern sequence-based recognizers perform well in high-resource settings, their accuracy degrades sharply as training data becomes scarce. Arabic-script languages share a common writing system with substantial character overlap, motivating cross-script training as a strategy to mitigate data scarcity. We performed experiments on Arabic, Urdu, and Persian scripts and achieved improvements over single-script baselines (new SotA especially for low-resource settings). A key finding of our experiments is that cross-script transfer is largely driven by script-level overlap rather than uniform accuracy improvements. Through a statistical character-level analysis we show that gains are structurally concentrated on characters shared across scripts, while language-specific characters exhibit limited or negative transfer. These findings provide insight into transfer dynamics in low-resource script families. Detailed results include: We conduct a controlled line-level study of cross-script joint training for Arabic-script HTR under low-resource regimes (number of samples K \in 100, 500, 1000 labeled lines) on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD). A CRNN model is trained on the union of multiple related Arabic-script datasets and evaluated on individual target languages. On Persian (PHTD), joint training achieves a Character Error Rate (CER) of 9.99, surpassing previously reported results despite not using the full available training data. On an Urdu dataset (UNHD), joint training reduces CER from 17.20 to 14.45. Code and data splits are released to ensure reproducibility.1
[CV-83] Observability Conditions and Filter Design for Visual Pose Estimation via Dual Quaternions
【速读】:该论文旨在解决视角-三点(Perspective-n-Point, PnP)求解器在6自由度(6-DOF)视觉目标跟踪中的两大关键问题:对噪声和异常值敏感,以及无法在测量丢失时传播估计状态。解决方案的核心在于构建一个基于对偶四元数(Dual Quaternion)的框架,通过李代数方法进行非线性可观测性分析,推导出在相对位置向量和单位向量测量两种传感模式下局部可观测性的充分条件;进一步设计了一种直接建模相对运动动力学的对偶四元数李群无迹卡尔曼滤波器(Dual Quaternion Lie Group Unscented Kalman Filter),无需假设协同测量或缓慢变化的运动特性,从而显著提升姿态估计精度与遮挡鲁棒性。
链接: https://arxiv.org/abs/2605.02054
作者: Nicholas B. Andrews,Kristi A. Morgansen
机构: 未知
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 3 tables, 5 figures
Abstract:This paper presents a dual quaternion framework for 6-DOF visual target tracking that addresses key limitations of perspective-n-point (P n P) solvers: sensitivity to noise and outliers, and inability to propagate estimates through measurement dropouts. A nonlinear observability analysis is performed using a Lie algebraic approach, deriving sufficient conditions for local observability under two sensing modalities: relative position vector and unit vector measurements. For the unit vector case, the classical collinear feature point degeneracy of the perspective-three-point problem is recovered through rank analysis of the observability codistribution matrix, providing a control-theoretic interpretation of a previously geometric result. A dual quaternion Lie group unscented Kalman filter is then developed, directly modeling relative dynamics without assumptions about cooperative measurements or slowly-varying motion. Simulations demonstrate improved pose estimation accuracy and robustness to occlusions compared to an off-the-shelf P n P solver. Results are broadly applicable to visual-inertial navigation, simultaneous localization and mapping, and P n P solver development.
[CV-84] How Can One Choose the Best CAM-Based Explainability Method for a CNN Model?
【速读】:该论文旨在解决生成式 AI (Generative AI) 中可解释性方法评估缺乏有效量化指标的问题,特别是如何衡量可解释性生成的显著图(saliency maps)与人类感知的一致性。传统常用指标如交并比(Intersection over Union, IoU)因显著图无固定形状而存在局限性。解决方案的关键在于引入多种距离度量(如曼哈顿距离和相关性)来计算人类标注边界框与显著图之间的对齐程度,并通过人群众包获取人类偏好排序,利用排名偏倚重叠(Rank-Biased Overlap, RBO)进行对比分析,从而筛选出最贴近人类感知的可解释方法。实验表明,曼哈顿距离和相关性是最优度量,LayerCAM、Score-CAM 和 IS-CAM 是表现最佳的可解释方法。
链接: https://arxiv.org/abs/2605.02007
作者: Daniel da Silva Costa,Pedro Nuno de Souza Moura,Adriana C. F. Alvim
机构: Postgraduate Program in Informatics, Federal University of the State of Rio de Janeiro (UNIRIO)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 8 pages, 4 figures and 7 tables. Code is available at: this https URL
Abstract:In recent years, several advances have been observed in Deep Learning with surprising results. Models in this area have been increasingly used in numerous applications, including those sensitive to human life, which require clear explanations and justifications. Various explainability methods have been proposed, but not many metrics to evaluate these methods. The most commonly used metric is the Intersection over Union (IoU). However, due to the characteristics of the results of the explainability methods, called saliency maps, which do not have a known shape, we hypothesise that there must be a better metric that allows one to find an explainability method that produces results that best resemble the human perception. We propose using different metrics to assess the similarity between human perception and the explanation saliency maps to find a better metric. An investigation was conducted employing a subset of the Chihuahuas images from ImageNet dataset. Several CAM-based explainability methods were used to generate saliency maps for each chihuahua image. Alignment was measured by applying distance metrics between the bounding box of human annotations and the saliency maps produced by each explainability method. Rankings of the best saliency maps were created using the results of the distance metrics and compared to the ranking obtained using people’s choice, collected through crowdsourcing, of the best explanation saliency maps for each selected image. Comparison between rankings was performed using the Rank-Biased Overlap (RBO) metric. The results indicate the feasibility of our method to find the explainability method that best resembles human perception. In our experiments, the two metrics that best resemble human perception corresponded to Manhattan and Correlation. Besides, the best explainability methods regarding human perception were LayerCAM, Score-CAM, and IS-CAM.
[CV-85] From Concept to Capability: Evaluating 3D Gaussian Splatting for Synthetic Scene Editing in Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统(Autonomous Driving System, ADS)在安全相关场景重建中因真实道路数据采集不完整而导致的验证难题,尤其针对罕见但高风险场景的覆盖不足问题。其解决方案的关键在于提出并实现了一个系统性的框架,用于评估基于3D高斯溅射(3D Gaussian Splatting, 3DGS)技术的场景重建质量,特别是在多视角(横向与纵向)下对车辆和行人这两类ADS最核心目标对象的重建保真度进行量化分析,从而为将此类生成式重建方法集成到工业级自动驾驶软件开发与测试流程提供可信赖的依据。
链接: https://arxiv.org/abs/2605.01995
作者: Ali Nouri,Yifei Zhang,Yifan Zhang,Tayssir Bouraffa,Zhennan Fei,Zijian Han,Håkan Sivencrona,Anders Heyden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 45th International Conference on Computer Safety, Reliability and Security (SafeComp 2026)
Abstract:The perception of an Autonomous Driving System (ADS) critically depends on relevant, comprehensive, and diverse datasets to ensure its safety while operating in the environment. Field data collection lacks completeness with respect to the list of rare but still possible safety-related scenarios needed for the development, verification, and validation of the ADS. 3D Gaussian Splatting (3DGS) has shown promising capabilities for the reconstruction and editing of scenes based on data collected by cameras and LiDAR sensors. However, the industrial fidelity evaluation of reconstructions is underexplored, which is crucial when employing such methods in safety-related systems, especially for ADS. This becomes more challenging as ADS operates in a dynamic, uncontrolled environment with limited viewpoints and often partially occluded objects. This paper addresses this gap by proposing and implementing a framework (Fig. 1) to systematically analyze the capabilities and limitations of 3DGS for use in the reconstruction of safety-related scenes. It focuses on the quality of reconstruction for vehicles and pedestrians, which are the two most critical object classes for ADS. Our findings provide industry insights into the fidelity degradation of reconstructions from multiple novel viewpoints, both lateral and longitudinal, enabling the integration of these methods into real-world industrial AD software development and testing pipelines.
[CV-86] ProtoFair: Fair Self-Supervised Contrastive Learning via Pseudo-Counterfactual Pairs
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)中表示学习所捕获的性别、种族等人口统计学偏见问题,这类偏见会削弱模型在公平性上的表现。现有方法通常通过重构自监督目标来提升公平性,但缺乏跨不同SSL框架的可移植性。论文提出ProtoFair,其核心创新在于设计一种公平感知的对比损失函数,无需修改原有SSL目标即可集成到主流框架(如SimCLR和SupCon)中;关键机制是利用无监督原型聚类识别伪反事实对——即属于同一聚类但来自不同敏感群体的样本对,通过拉近这些内容一致但群体不同的样本在嵌入空间中的距离,促使编码器学习对敏感属性不变的表征,从而实现公平性增强。该方法仅需敏感属性标注,不依赖目标任务标签,具有良好的实用性与通用性。
链接: https://arxiv.org/abs/2605.01971
作者: Marah Halawa,Olaf Hellwich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning methods learn high-quality visual representations, yet recent studies show that these representations often capture demographic biases present in the training data. Existing fairness-aware methods address this by redesigning the self-supervised objective itself, limiting portability across the rapidly evolving landscape of self-supervised learning (SSL) frameworks. We propose ProtoFair, a fairness-aware contrastive loss designed to work alongside existing SSL objectives without modifying them. ProtoFair leverages unsupervised prototype clustering to identify pseudo-counterfactual pairs: samples sharing the same cluster assignment but belonging to different sensitive groups. By pulling these content-matched, cross-group samples together in the embedding space, ProtoFair encourages the encoder to learn representations that are invariant to the sensitive attribute. The method requires only sensitive attribute annotations, no target labels, and integrates seamlessly with both SimCLR and SupCon. Experiments on CelebA and UTKFace demonstrate consistent fairness improvements while maintaining competitive accuracy.
[CV-87] MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
【速读】:该论文旨在解决多模态域泛化(Multimodal Domain Generalization, MMDG)问题,即在真实场景中部署多模态模型时,如何使模型适应训练数据分布之外的新环境,尤其是当不同环境下的记录条件(如光照、音频噪声等)存在差异时。现有方法通常采用独立的模态编码器与融合模块,并通过端到端优化联合特征来训练模型,但研究发现这种联合优化会导致编码器过度依赖跨模态共现关系(cross-modal co-occurrences),即源域特定记录条件下产生的模态间统计关联,而非学习域不变特征,这一现象被称为“融合过拟合”(Fusion Overfitting)。为解决此问题,作者提出了一种架构无关的解决方案——模态熵正则化(Modality-Entropy Regularization for Domain Generalization, MER-DG),其核心思想是在训练过程中最大化每个模态编码器输出特征分布的熵,从而增强特征多样性并抑制对源域特定统计关系的依赖,该策略可作为附加损失项集成至现有多模态框架中。实验表明,MER-DG在EPIC-Kitchens和HAC基准上相较标准融合方法平均提升约5%,相较当前最优方法提升约2%。
链接: https://arxiv.org/abs/2605.01967
作者: Yavuz Yarici,Ghassan AlRegib
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder’s feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.
[CV-88] Sonar-GPS Fusion for Seabed Mapping in Turbid Shallow Waters with an Autonomous Surface Vehicle ICRA2026
【速读】:该论文旨在解决浑浊浅海区域(如贝类养殖区)中传统光学遥感方法失效背景下,基于前视声呐(Forward-looking Sonar, FLS)的自主水面航行器(Autonomous Surface Vehicles, ASVs)在长轨迹下因定位精度低和累积漂移导致的高分辨率海底测绘难题。其解决方案的关键在于构建一种抗漂移的海底制图框架:首先利用傅里叶-梅林变换(Fourier-Mellin Transform, FMT)实现局部声呐图像帧间的精确对齐,再通过融合全球定位系统(GPS)、惯性测量单元(IMU)和罗盘数据的扩展卡尔曼滤波(Extended Kalman Filter, EKF)进行全局轨迹优化,从而有效抑制长期运动带来的误差积累;同时采用基于方差的图像融合策略降低重叠区域的视觉伪影,最终实现了亚米级重建精度并保留了用于牡蛎库存估算所需的高分辨率纹理信息。
链接: https://arxiv.org/abs/2605.01949
作者: Yisheng Zhang,Michael Xu,Alan Williams,Matthew Gray,Nare Karapetyan,Miao Yu
机构: University of Maryland (马里兰大学); University of Maryland Center for Environmental Science (马里兰大学环境科学中心); Woods Hole Oceanographic Institution (伍兹霍尔海洋研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Abstract:Accurate seabed mapping is essential for habitat monitoring and infrastructure inspection. In turbid, shallow coastal waters, such as shellfish aquaculture farms, the effectiveness of traditional optical methods is limited. Autonomous surface vehicles (ASVs) equipped with forward-looking sonar (FLS) offer a promising alternative. However, existing sonar-based systems face challenges in achieving fine resolution mapping over long trajectories due to low-resolution positioning measurements and accumulated drift over long trajectories. In this paper, we present a drift-resilient seabed mapping framework that integrates local FLS frame alignment using the Fourier-Mellin transform (FMT) with global trajectory optimization based on an extended Kalman filter (EKF) that fuses global positioning system (GPS), inertial measurement unit (IMU), and compass data. A variance-based image blending strategy is used to further reduce visual artifacts in overlapping regions. Field trials on a structured oyster farm site show that our framework helps reduce drift in RMSE by 9.5% relative to the FMT-only baseline. This framework also enables sub-meter reconstruction accuracy and preservation of high-resolution textures needed for oyster inventory estimation within the mapped areas.
[CV-89] ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA
【速读】:该论文旨在解决生成式 AI (Generative AI) 中 Vision Mamba (ViM) 模型在资源受限边缘设备上高效部署的挑战,具体包括:线性层因动态激活异常值导致静态量化失效、低比特位宽下均匀量化难以捕捉权重分布,以及关联扫描(associative scan)加速器在 GPU 上有效但与 FPGA 流式数据流不匹配的问题。解决方案的关键在于提出一种算法-硬件协同设计 ViM-Q,其核心包括:面向硬件的量化策略,结合动态 per-token 激活量化与 per-channel 平滑以缓解异常值影响,并采用 4-bit per-block Additive Power-of-Two (APoT) 权重量化;同时设计可运行时配置的 FPGA 加速器架构,包含基于查找表(LUT)的乘法替代单元和细粒度流水线化状态空间模型(SSM)引擎,支持不同 ViM 模型维度与输入分辨率的自适应调整,从而实现低批处理推理下的显著性能提升(平均 4.96x 加速比)和能效优化(59.8x 能效增益)。
链接: https://arxiv.org/abs/2605.01935
作者: Shengzhe Lyu,Yuhan She,Patrick S. Y. Hung,Ray C. C. Cheung,Weitao Xu
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM 2026). Code: this https URL
Abstract:Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight quantization. The models are deployed on a runtime-parameterizable FPGA accelerator featuring a linear engine employing a Lookup-Table (LUT) unit to replace multiplications with shift-add operations, and a fine-grained pipelined SSM engine that parallelizes the state dimension while preserving sequential recurrence. Crucially, the hardware supports runtime configuration, adapting to diverse dimensions and input resolutions across the ViM family. Implemented on an AMD ZCU102 FPGA, ViM-Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low-batch inference on ViM-tiny. This co-design shows a viable path for deploying ViM models on resource-constrained edge devices.
[CV-90] Exploring Data-Free LoRA Transferability for Video Diffusion Models ICML2026
【速读】:该论文旨在解决现有低秩适配(LoRA)在视频扩散模型中因权重空间不匹配而导致的风格退化与结构坍塌问题,其核心挑战在于不同蒸馏范式(如步长蒸馏或因果蒸馏)间存在谱干扰,导致共享功能簇内的路由路径冲突。解决方案的关键是提出一种无需数据的Cluster-Aware Spectral Arbitration (CASA) 框架,通过动态权衡目标流形保护与LoRA对齐恢复,基于谱密度进行仲裁,从而有效缓解伪影并恢复LoRA的功能性。
链接: https://arxiv.org/abs/2605.01929
作者: Yuchen Wang,Wenliang Zhong,Lichen Bai,Zikai Zhou,Shitong Shao,Bojun Cheng,Shuo Chen,Shuo Yang,Zeke Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Video diffusion models leveraging step distillation or causal distillation have achieved remarkable performance. However, adapting existing LoRAs to these variants remains a critical challenge due to weight space mismatches. We observe that direct application leads to style degradation and structural collapse, yet the underlying mechanisms remain poorly understood. To fill this gap, we delve into the weight space and identify that the incompatibility stems from spectral interference within shared functional clusters defined over singular subspaces. Specifically, our analysis reveals that while both paradigms respect spectral rigidity, they establish conflicting routing pathways that clash through constructive overload or destructive cancellation. To address this issue, we propose Cluster-Aware Spectral Arbitration (CASA), a data-free framework that dynamically arbitrates between safeguarding the target’s manifold and restoring LoRA alignment based on spectral density. Extensive experiments demonstrate that CASA effectively mitigates artifacts and revives LoRA functionality. Our code is available at this https URL
[CV-91] CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models CVPR2026
【速读】:该论文旨在解决现有生成式计算机辅助设计(CAD)系统在复杂性与现实性上的局限性问题,即当前方法受限于简化的表示方式和有限的数据集,仅能支持草图-拉伸(sketch-extrude)类操作,难以生成具有多步骤、多样化特征的高质量CAD设计历史。其解决方案的关键在于提出一种基于FeatureScript的结构化表示方法,并构建包含450k个真实世界CAD模型、涵盖15种建模操作的数据集;同时开发了一套可执行的FeatureScript程序重建管道及多模态标注机制,使视觉语言模型(VLM)能够学习到更丰富、对齐的文本-设计映射关系。实验证明,该框架显著提升了文本条件下的CAD生成和图像驱动重构性能,在准确性、多样性和特征丰富度方面均优于先前方法。
链接: https://arxiv.org/abs/2605.01925
作者: Vladislav Pyatov,Gleb Bobrovskikh,Saveliy Galochkin,Nikita Boldyrev,Oleg Voynov,Alexander Filippov,Gonzalo Ferrer,Peter Wonka,Evgeny Burnaev
机构: Applied AI Institute; AI Foundation and Algorithm Lab; AXXX; KAUST
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026
Abstract:We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations. We obtain the dataset via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that each individual component of our framework, i.e., the FeatureScript representation, the extended operation set, and representation-aligned textual descriptions, significantly improves performance. Our framework substantially broadens the complexity and realism achievable in generative CAD. The CADFS framework and the new dataset are available at this https URL.
[CV-92] SimPB: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras
【速读】:该论文旨在解决多摄像头自动驾驶场景中同时实现透视视角(perspective view)下的2D目标检测与鸟瞰图(Bird’s Eye View, BEV)下的3D目标检测的挑战问题。现有两阶段方法仅将2D结果作为一次性线索用于3D检测,难以实现高效协同。其解决方案的关键在于提出SimPB++,一个统一的端到端模型,采用混合解码器架构,通过动态查询分配(Dynamic Query Allocation)和自适应查询聚合(Adaptive Query Aggregation)两个模块实现多视图2D与3D解码器之间的深度交互,形成循环式3D-2D-3D精化机制;同时引入Query-group Attention提升多视角2D检测性能,并设计Crop-and-Scale策略增强远距离感知能力,从而在nuScenes数据集上实现2D和3D检测的SOTA性能,且支持混合监督策略以降低对昂贵3D标注的依赖。
链接: https://arxiv.org/abs/2605.01924
作者: Yingqi Tang,Zhaotie Meng,Erkang Cheng,Haibin Ling
机构: Nullmax; Westlake University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2403.10353
Abstract:Simultaneous perception of 2D objects in perspective view and 3D objects in Bird’s Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a Propagating Denoising strategy with an auxiliary RoI detector. SimPB++ supports mixed supervision with 2D-only and fully annotated data, reducing reliance on expensive 3D labels. Experiments show state-of-the-art performance on nuScenes for both tasks and strong long-range detection (up to 150m) on Argoverse2.
[CV-93] EAPFusion: Intrinsic Evolving Auxiliary Prior Guidance for Infrared and Visible Image Fusion
【速读】:该论文旨在解决红外-可见光图像融合中因静态权重无法适应场景特定内容、以及注入粗粒度辅助语义时导致的目标突出与细节保留难以兼顾的问题。解决方案的关键在于提出EAPFusion框架,其核心创新是引入自进化内在先验(self-evolving intrinsic priors),通过跨尺度动态更新一组紧凑的先验信息,并利用这些先验条件生成卷积核,从而实现从固定预训练滤波器到实例自适应参数的范式转变——即基于先验的动态卷积(prior-conditioned dynamic convolution)。此外,设计了通道级融合模块,通过对红外与可见光通道进行混洗和交错操作并应用局部通道混合,显著增强跨模态互补性,最终在多个数据集上实现了最先进的融合效果及下游任务性能提升。
链接: https://arxiv.org/abs/2605.01916
作者: Zhenyu Sun,Luobin Zhang,Axi Niu,Haishen Wang,Qingsen Yan
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures
Abstract:Infrared-visible image fusion aims to create an information-rich fused image by integrating the complementary thermal saliency from infrared sensing and fine textures from visible imaging. Such accurate fusion is essential for real-world perception applications in complex scenes, including nighttime autonomous driving, search and rescue, and surveillance, and can further benefit downstream tasks such as semantic segmentation. However, most existing fusion methods rely upon static trained weights that cannot adapt to scene-specific content at inference time, and often suffer from a granularity mismatch when coarse auxiliary semantics are injected, which makes it difficult to simultaneously highlight targets and preserve details. In this work, we propose EAPFusion to address these issues by using self-evolving intrinsic priors instead of relying on external auxiliary models. Concretely, EAPFusion maintains a compact set of intrinsic priors and progressively updates them across scales. These evolved priors are utilized to dynamically generate convolutional kernels, shifting the paradigm from fixed, pre-trained filters to instance-adaptive parameters via prior-conditioned dynamic convolution. Furthermore, we design a channel-level fusion module that shuffles and interleaves infrared and visible channels, applying local channel mixing to boost cross-modal complementarity. Experiments on different datasets, including cross-dataset evaluation and semantic segmentation, show that the proposed method achieves state-of-the-art quantitative and qualitative fusion results, and consistently boosts downstream performance. Code is coming soon.
[CV-94] SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
【速读】:该论文旨在解决现有外科视觉问答(surgical visual question answering, VQA)数据集中存在的语言捷径(linguistic shortcuts)问题,即问题表述中的实体名称隐式限制了答案空间,导致模型性能提升可能源于对语言模式的依赖而非真正的视觉理解。解决方案的关键在于提出SurgCheck诊断基准,其核心是采用成对问题设计:每张手术图像对应一个包含实体名称的原始问题和一个去除这些名称但保留相同视觉内容与正确答案的“无偏”版本,通过对比两者性能差异量化模型对语言捷径的依赖程度;同时引入边界框、箭头、空间位置和迂回表达四种语境提示(grounding cues)确保无偏问题仍具可解性,从而实现对视觉理解能力的更可靠评估。
链接: https://arxiv.org/abs/2605.01911
作者: Jongmin Shin,Ka Young Kim,Eunki Cho,Seong Tae Kim,Namkee Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly constrains the answer space. It remains unclear whether reported performance reflects visual understanding or reliance on such linguistic shortcuts. Methods: We introduce SurgCheck, a diagnostic benchmark for quantifying linguistic shortcut reliance in surgical VQA. SurgCheck employs a paired-question design in which each surgical frame is associated with an original question containing entity names and a less-biased counterpart that removes these names while preserving identical visual content and ground-truth answers. The resulting performance gap provides a diagnostic signal of shortcut reliance. To ensure that the less-biased question remains well-defined even without entity names, four grounding cues are incorporated: bounding box, arrow, spatial position, and periphrasis. We evaluate both general-purpose and surgical-specific VLMs under zero-shot and fine-tuned settings on SurgCheck. To evaluate open-ended zero-shot responses, we introduce an LLM-as-a-judge evaluation protocol. Results: Using SurgCheck, we observe consistent performance degradation on less-biased questions across five VLMs, despite identical visual inputs. Text-only ablation reveals minimal performance drops for action and target prediction, indicating that action and target prediction is largely driven by linguistic shortcuts rather than visual reasoning. Conclusion: SurgCheck provides a controlled diagnostic framework that exposes failure modes masked by linguistic bias in existing surgical VQA benchmarks. Our findings demonstrate that strong benchmark performance does not necessarily imply faithful visual understanding, underscoring the need for bias-aware evaluation in surgical VQA.
[CV-95] Behavior-Grounded Lane Representation Learning for Multi-Task Traffic Digital Twins
【速读】:该论文旨在解决交通数字孪生系统中静态几何表示无法捕捉动态功能语义的问题,从而阻碍了行为感知推理(如车道在复杂交通条件下的运行机制)。其解决方案的关键在于提出GeoLaneRep框架,通过联合编码静态车道几何、观测车辆轨迹与操作描述符,构建跨摄像头的共享语义嵌入空间;该嵌入空间利用对比跨摄像头对齐、辅助角色监督和时间异常检测的联合目标进行训练,实现了零样本跨摄像头匹配的高精度(横向排名误差0.004,边缘角色F1为1.000)以及窗口级异常检测的优异性能(AUROC为0.991),并进一步支持基于扩散模型的目标导向车道合成(整体规范准确率达87.9%)。
链接: https://arxiv.org/abs/2605.01901
作者: Rei Tamaru,Pei Li,Bin Ran
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Wyoming (怀俄明大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic digital twins are powerful tools for advanced traffic management, and most systems are built on static geometric representations. However, these representations fail to capture the dynamic functional semantics required for behavior-aware reasoning, such as how a lane operates under complex traffic conditions. To address this gap, we introduce GeoLaneRep, a behavior-grounded lane representation learning framework for traffic digital twins. GeoLaneRep jointly encodes static lane geometry, observed vehicle trajectories, and operational descriptors into a shared, cross-camera semantic embedding. The encoder is trained with a joint objective combining contrastive cross-camera alignment, auxiliary role supervision, and temporal anomaly detection. Across 16 roadside cameras and 132 lanes, the learned embeddings achieve a 0.004 lateral-rank error and an edge-role F1 of 1.000 in zero-shot cross-camera matching, and an AUROC of 0.991 for window-level anomaly detection. We further show that the same behavioral embeddings can condition a diffusion-based generator to synthesize lane geometries that satisfy targeted operational specifications, with 87.9% overall specification accuracy across 38 lane groups. GeoLaneRep thus provides a semantic interface between roadside observations and downstream digital twin tasks, supporting cross-camera transfer, behavior-aware monitoring, and goal-directed lane synthesis. The framework is openly available at this https URL.
[CV-96] Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
【速读】:该论文旨在解决多模态视频生成中未能充分利用现有基础模型(foundation models)丰富先验知识的问题。当前方法虽尝试联合生成多种模态(如RGB、深度图和掩码)的视频,但未有效挖掘不同模态空间下基础模型所蕴含的领域特异性先验。解决方案的关键在于提出M²-REPA,一种专为多模态视频生成设计的表示对齐方法:首先从扩散模型的中间表示中解耦出模态特异性特征,再将其分别与对应模态的基础模型专家进行对齐;同时引入两个协同优化的目标——多模态表示对齐损失(enforces feature-to-expert matching)和模态特异性解耦正则化项(encourages complementarity across modalities),从而实现跨模态先验的联合优化与充分利用。
链接: https://arxiv.org/abs/2605.01896
作者: Junyuan Xiao,Dingkang Liang,Xin Zhou,Yixuan Ye,Tongtong Su,Guangmo Yi,Bin Xia,Qiang Lyu,Shurui Shi,Jun Huang,Jianlou Si,Wenming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 26 pages, 7 figures, with supplementary material
Abstract:Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose M^2 -REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary “experts.” Specifically, we first decouple modality-specific features from the diffusion model’s intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.
[CV-97] AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel Impairments
【速读】:该论文旨在解决车联网协同感知(V2X cooperative perception)在存在信道干扰(如噪声、衰落和干扰)时导致的特征退化问题,从而提升智能交通系统的鲁棒性。其核心解决方案是提出一种基于Transformer架构的自适应特征融合网络AFFormer,关键创新在于引入三个模块:多智能体与时间聚合模块用于跨代理和时间维度的上下文感知融合,双空间注意力机制高效建模空间依赖关系,以及不确定性引导融合模块通过熵驱动方式优化融合特征;此外,采用教师-学生知识蒸馏策略,利用早期协作监督信号对齐特征表示,进一步增强系统在通信质量下降条件下的检测稳定性与准确性。
链接: https://arxiv.org/abs/2605.01888
作者: Xi Zhou,Tao Huang,Qing-Long Han,Rana Abbas,Mostafa Rahimi Azghadi
机构: James Cook University (詹姆斯库克大学); Swinburne University of Technology (斯威本科技大学); Transport for NSW (新南威尔士州交通局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle-to-everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer-based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter-agent, and spatial correlations. AFFormer introduces three key modules: Multi-Agent and Temporal Aggregation for context-aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty-Guided Fusion for entropy-driven refinement of fused features. A teacher-student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early-collaboration supervision. AFFormer is validated on the V2XSet and DAIR-V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication-induced feature degradation while maintaining a competitive efficiency-accuracy trade-off.
[CV-98] Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在高信息密度(High Information Density, HID)图表理解与推理任务中表现不佳的问题,其核心挑战包括:(1)细粒度感知能力有限导致关键视觉线索被忽略;(2)冗余或噪声视觉信息干扰多模态推理性能;(3)缺乏根据视觉信息量自适应调整深度推理的能力。解决方案的关键在于提出一种聚焦驱动的细粒度图表推理模型 Chart-FR1,其核心技术包括:(1)Focus-CoT——一种视觉聚焦链式思维机制,通过显式关联推理步骤与关键视觉线索(如局部图像区域和OCR信号)提升细粒度感知;(2)Focus-GRPO——一种聚焦驱动的强化学习算法,引入信息效率奖励以压缩冗余视觉信息,并设计自适应KL惩罚机制实现推理深度随发现视觉线索数量灵活调整。
链接: https://arxiv.org/abs/2605.01882
作者: Hongkun Pan,Yuwei Wu,Wanyi Hong,Shenghui Hu,Qitong Yan,Yi Yang,Rufei Han,Changju Zhou,Minfeng Zhu,Dongming Han,Wei Chen
机构: Zhejiang University (浙江大学); State Key Lab of CADCG, Zhejiang University (浙江大学CADCG国家重点实验室); HiThink Research (HiThink研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning. Code is available at this https URL.
[CV-99] BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton
【速读】:该论文旨在解决非周期性竞技体育(如羽毛球)中缺乏高质量多模态数据集的问题,特别是缺乏同步的地面反作用力(Ground Reaction Force, GRF)与高帧率多视角视频的公开数据集,这限制了在真实训练场景下无标记负荷估计(markerless load estimation)的研究进展。其解决方案的关键在于构建了一个名为BadmintonGRF的数据集,该数据集包含8路同步RGB视频(约120 FPS)、4个Kistler力板和Vicon运动捕捉系统(C3D格式),通过人工验证事件、自动化质量控制及每摄像头时间偏移校正结合不确定性元数据实现跨模态对齐;同时提供两个层级的数据分发策略:Tier 1聚焦于姿态估计与时间对齐的GRF映射任务,无需原始RGB或C3D数据即可进行基准测试;Tier 2则在受控访问下提供原始多视角RGB与C3D数据,支持更深入的外观建模或完整运动学分析。该数据集涵盖17,425个击球片段(10名受试者,156次仪器化试验),并配套预处理代码、留一被试者交叉验证划分、10个基线模型及可选的晚期融合机制,显著推动了基于多模态感知的羽毛球击打负荷估计研究。
链接: https://arxiv.org/abs/2605.01876
作者: Kuoye Niu,Jianwei Li,Shengze Cai,Yong Ma,Mengyao Jia,Lishun Shen,Zhenheng Zhang,Yuxin Peng,Xian Song
机构: Beijing Sport University (北京体育大学); Zhejiang University (浙江大学); Wuhan Sports University (武汉体育学院); Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal resources for non-periodic court sports with laboratory-grade sensing remain scarce: few publicly pair instrumented ground reaction force (GRF) with high-frame-rate multi-view video, limiting markerless load estimation in realistic training settings. BadmintonGRF records eight synchronized RGB views at ~120 FPS, four Kistler force plates, and Vicon motion capture (C3D) without hardware genlock across modalities; alignment combines human-verified events, automated quality assurance, and per-camera time offsets with uncertainty metadata. Tier 1 distributes pose, time-aligned GRF, metadata, and splits under CC BY-NC 4.0, enabling the primary benchmark without raw RGB or C3D; we report a Tier 1 task that maps 2D pose to GRF. Tier 2 provides raw RGB and C3D under controlled access for studies that require appearance or full kinematics. The public release contains 17,425 impact-segment archives in the 10-subject benchmark tree (156 instrumented trials; raw multi-view RGB alone exceeds 1 TB); benchmark loader gates retain 12,867 view-specific instances and 1,732 unique impacts after multi-view deduplication. We are not aware of prior public badminton corpora that combine this sensing layout with audited video–GRF alignment for impact-centric GRF estimation. We distribute preprocessing code, leave-one-subject-out splits, ten reference baselines, and optional late fusion (one deterministic test-time pass per instance; no test-time augmentation), with a within-trial diagnostic in the supplementary material.
[CV-100] Evolving Token Communication with Parametric Memory Network
【速读】:该论文旨在解决语义令牌(semantic token)在无线传输中因完整发送而导致通信开销过大的问题。解决方案的关键在于提出一种基于参数化记忆网络的演化语义令牌通信系统,其核心创新是仅传输每个语义令牌的等长前缀,从而降低传输成本,同时在接收端利用嵌入于网络参数中的语义记忆,通过kNN引导的教师分布对预训练GPT-2恢复模块进行微调,以推断并重构缺失的后缀信息,实现完整的语义令牌恢复。此外,引入在线演化策略定期利用新观测样本更新记忆网络和整个系统,提升在信道分布偏移下的适应能力。
链接: https://arxiv.org/abs/2605.01869
作者: Weixuan Chen,Qianqian Yang
机构: Zhejiang University (浙江大学)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Token communication has emerged as a promising framework for efficient wireless transmission by representing source data as compact semantic tokens. However, transmitting full semantic tokens still incurs considerable communication overhead. In this paper, we propose an evolving semantic token communication system with a parametric memory network over MIMO fading channels. Specifically, only an equal-length prefix of each semantic token is transmitted, which reduces transmission cost while preserving a consistent token structure for receiver-side recovery. At the receiver, a parametric memory network is introduced to reconstruct the missing suffix information from the received token prefixes, where semantic memory is stored implicitly in the network parameters. To realize this design, full semantic tokens are first organized into a codebook, and truncated tokens are paired with the codeword labels of their corresponding full tokens. Based on these token-label pairs, kNN-based teacher distributions are constructed to fine-tune a pretrained GPT-2-based recovery module, which learns to infer the codeword distribution of each incomplete token and recover the corresponding complete semantic token. In addition, an online evolution strategy is developed to periodically update the parametric memory network and the entire system using newly observed test samples, thereby improving adaptability under distribution shifts. Experimental results demonstrate that the proposed method consistently outperforms the existing evolving memory benchmark under different channel conditions and channel bandwidth ratios, with up to 1.09 dB PSNR improvement.
[CV-101] Decouple and Cache: KV Cache Construction for Streaming Video Understanding
【速读】:该论文旨在解决流式视频理解中两个核心挑战:一是如何在有限内存和计算资源下持续构建新键值(Key-Value, KV)缓存并淘汰旧缓存以应对无界视频流;二是如何让模型从短序列训练中学习,并有效泛化到长视频流场景。现有方法要么无法扩展至无界流式视频,要么仅关注缓存复用策略,忽视了缓存构造机制的影响。为此,论文提出了一种无需训练的缓存构造机制——解耦流式缓存(Decoupled Streaming Cache, DSCache),其关键在于将历史KV缓存与按需构建的瞬时缓存分离,从而保留近期输入的信息丰富性;同时引入位置无关编码策略,支持超出训练长度的位置外推,避免位置溢出问题,显著提升模型在流式视频问答任务上的性能表现。
链接: https://arxiv.org/abs/2605.01858
作者: Zhanzhong Pang,Dibyadip Chatterjee,Fadime Sener,Angela Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 10 tables
Abstract:Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache’s state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
[CV-102] High-Fidelity Mobile Avatars with Pruned Local Blendshapes CVPR2026
【速读】:该论文旨在解决从多视角视频中重建高保真度人体虚拟形象(human avatar)并在移动设备上高效运行的问题。现有方法虽能生成高质量基于高斯分布的全身虚拟形象,但其依赖复杂的姿态相关外观建模,计算量大,难以部署于移动端;而近期轻量化方法虽可适配移动设备,却因线性组合全局姿态特征与混合形状(blendshapes)导致细节损失。论文的关键解决方案在于:首先观察到局部区域内高斯点属性具有高度相关性,因此采用小肢体区域的局部线性混合形状(local linear blendshapes)来更精确地捕捉全局非线性变化;其次提出移除属性变化较小的高斯点对应的混合形状,从而构建最小化混合形状表示以进一步压缩计算量和模型尺寸;最终实现无需预训练模型的端到端训练,并通过WebGPU框架支持多设备部署,实现在移动设备上以2K分辨率稳定达到120 FPS的高质量渲染性能。
链接: https://arxiv.org/abs/2605.01854
作者: Youyi Zhan,He Wang,Tianjia Shao,Kun Zhou
机构: Zhejiang University (浙江大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026. Project page this https URL
Abstract:We propose a method to reconstruct high-fidelity human avatars from multi-view video that can run on mobile devices. Many works can model high-quality Gaussian-based full-body avatars from multi-view video. However, these methods require heavy computation to obtain pose-dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose-dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it run on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high-quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.
[CV-103] DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
【速读】:该论文旨在解决多视角三维重建(multi-view 3D reconstruction)中的尺度模糊问题(scale ambiguity),即在缺乏已知尺寸参考物或预先标定的情况下,传统方法无法恢复场景的绝对尺度。解决方案的关键在于利用双像素(dual-pixel, DP)传感器捕获的图像中包含的散焦模糊信息,结合从多视角立体匹配中恢复的相对深度图(depth maps up to scale),通过一个线性估计方法自动推导出绝对尺度。进一步地,作者设计了一个基于强度的优化阶段,利用跨视图的模糊核将左右DP图像相互校正,从而实现高精度的尺度恢复与图像对齐。
链接: https://arxiv.org/abs/2605.01852
作者: Lilika Makabe,Kohei Ashida,Hiroaki Santo,Fumio Okura,Yasuyuki Matsushita
机构: Graduate School of Information Science and Technology, The University of Osaka, Japan(大阪大学信息科学与技术研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at this https URL
[CV-104] Disentangled Anatomy-Disease Diffusion (DADD) for Controllable Ulcerative Colitis Progression Synthesis CVPR
【速读】:该论文旨在解决在保持患者特异性解剖结构的前提下,可控地合成不同疾病阶段的纵向医学图像的问题,尤其针对溃疡性结肠炎(Ulcerative Colitis, UC)内镜图像中病理纹理与结构特征纠缠导致的建模困难。其解决方案的关键在于提出了一种解耦式 Anatomy-Disease Diffusion (DADD) 框架,通过两个互补的嵌入条件:预训练图像编码器用于捕获患者特异性解剖信息,以及独立训练的序数嵌入器用于表征累积疾病严重程度;进一步引入基于交叉注意力机制的 Feature Purifier 来识别并抑制与疾病相关的通道,从而获得纯净的解剖表示;并通过 Triple-Pathway Cross-Attention 机制结合分辨率依赖的路由门,将净化后的解剖 token 与目标疾病 token 注入去噪网络,利用 U-Net 层次结构区分全局结构与细粒度病理纹理;此外,创新性地提出 Delta Steering 信号,在推理阶段无需额外前向传播即可实现对疾病进展方向的显式单次控制。
链接: https://arxiv.org/abs/2605.01848
作者: Umut Dundar,Alptekin Temizel
机构: Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures. Code and dataset are publicly available. Accepted for presentation at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Synthetic Data for Computer Vision Workshop (SynData4CV)
Abstract:Synthesizing longitudinal medical images at controllable disease stages while preserving patient-specific anatomy is hindered by the entanglement of pathological textures and structural features. We address this challenge for ulcerative colitis (UC) endoscopy, where severity follows a continuous ordinal progression along the Mayo Endoscopic Score (MES). Our framework, Disentangled Anatomy-Disease Diffusion (DADD), conditions a latent diffusion model on two complementary embeddings: a pretrained image encoder for patient anatomy and a separately trained ordinal embedder for cumulative disease severity. Since image embeddings inevitably capture disease information, we introduce a Feature Purifier, a cross-attention-based erasure mechanism that identifies and suppresses disease-correlated channels, yielding purified anatomical representations. These cleaned anatomy tokens and target disease tokens are injected into the denoising network via a Triple-Pathway Cross-Attention mechanism with resolution-dependent routing gates. This architecture leverages the U-Net hierarchy, in which different network depths encode global structure versus fine-grained pathological texture. Furthermore, we introduce Delta Steering, a training-free directional signal derived from the ordinal embeddings that enables explicit, single-pass control over disease transitions at inference without requiring additional forward passes. Validated on the LIMUC dataset, our approach produces high-fidelity images across all severity levels and effectively rebalances skewed class distributions, enhancing performance for downstream classification tasks. The dataset is available at this http URL and the code base at this http URL
[CV-105] GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models CVPR
【速读】:该论文旨在解决脑部磁共振成像(MRI)基础模型中临床信息编码难以解释的问题,尤其是传统稀疏自编码器(Sparse Autoencoders, SAEs)在深度Transformer层中易发生特征坍缩(feature collapse),以及阿尔茨海默病(Alzheimer’s Disease, AD)研究中年龄混杂因素对临床变量标注可靠性的影响。其解决方案的关键在于提出GeoSAE框架,该框架利用基础模型学习到的流形结构(manifold structure)来抑制特征坍缩,并通过去年龄混杂的部分相关性分析(age-deconfounded partial correlations)对保留特征进行可解释标注。此方法在约1.4万例T1加权MRI数据上验证了其有效性,仅用2%的嵌入维度即可预测轻度认知障碍(MCI)向AD转化(AUC=0.746),且特征具有跨队列稳定性与神经解剖学一致性,证明几何引导的稀疏自编码器能够从冻结的脑部MRI基础模型中提取出可解释的生物标志物。
链接: https://arxiv.org/abs/2605.01829
作者: Favour Nerrise(1),Lucy Yin(1),Mohammad H. Abbasi(1),Kilian M. Pohl(1),Ehsan Adeli(1) ((1) Stanford University)
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR Workshop on Computer Vision for Clinical Applications (CV4Clinical) 2026, 9 pages, 5 figures, 2 tables, for associated code, see this https URL
Abstract:Brain MRI foundation models learn rich representations of anatomy, but interpreting what clinical information they encode remains an open problem. Standard sparse autoencoders (SAEs) suffer from severe feature collapse in deep transformer layers, and in Alzheimer’s disease (AD) research, aging confounds nearly every clinical variable, making naive annotation unreliable. We propose GeoSAE, a geometry-guided SAE framework that uses the foundation model’s learned manifold structure to prevent feature collapse and annotates each surviving feature via age-deconfounded partial correlations. Applied to ~14k T1-weighted MRI scans from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Australian Imaging biomarkers and Lifestyle (AIBL) datasets, GeoSAE identifies a compact, fully interpretable feature set that predicts mild cognitive impairment (MCI)-to-AD conversion (AUC 0.746) using only 2% of the embedding dimensions, while comorbidity-annotated features achieve only chance-level performance. The identified features replicate across cohorts without retraining (r=0.97) and localize to neuroanatomically distinct regions consistent with Braak staging. This shows that geometry-guided SAEs can extract interpretable, biomarkers from frozen brain MRI foundation models.
[CV-106] Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering ICML2026
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在区域级视觉指代理解上的局限性,尤其是当多个区域被同时引用或需要全局上下文信息以实现精准指代时的表现不佳问题。其解决方案的关键在于提出一种无需训练的“上下文潜在引导”(Contextual Latent Steering, CSteer)方法:通过预先计算隐式表征视觉指代行为的上下文向量(如区域间的区分度和对全局场景的关注),并在推理阶段进行表示编辑,从而引导通用LMM实现多区域的语境化指代,且无需昂贵的微调或架构修改。实验表明,该方法显著优于专门设计的指代模型,并在该领域达到新的最先进水平。
链接: https://arxiv.org/abs/2605.01827
作者: Yun Xing,Hanyuan Liu,Jiahao Nie,Shijian Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Large Multimodal Models (LMMs) have recently demonstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle region-level perception guided by visual prompts, especially for cases where multiple regions are referred simultaneously, or scenarios where global contexts are necessary for precise visual referring. We introduce Contextual Latent Steering (CSteer), a training-free approach for guiding general LMMs to refer multiple regions contextually, without expensive fine-tuning or architectural modifications. CSteer starts with pre-computing contextual vectors that implicitly represent visual referring behaviors, such as differentiation among regions and attention to global contexts, followed by representation editing during inference time. Experimental results on multiple datasets indicate that general LMMs with CSteer outperform tailored referring LMMs in most cases, suggesting a promising solution in training-free, and setting new state-of-the-art for this field. Code is available at this https URL.
[CV-107] Hybrid Visual Telemetry for Bandwidth-Constrained Robotic Vision: A Pilot Study with HEVC Base Video and JPEG ROI Stills
【速读】:该论文旨在解决带宽受限的机器人与监控系统中视频流传输与下游机器感知之间的不匹配问题:低比特率视频虽能保留运动信息和粗粒度场景上下文,但会丢失细粒度局部细节,从而影响目标识别与决策的可靠性。其解决方案的关键在于提出一种双通道视觉遥测(two-channel visual telemetry)架构,即通过连续低比特率视频流提供动态场景理解,同时结合选择性传输高细节静态感兴趣区域(Region of Interest, ROI)图像来支持近距离识别与分析,从而在总通信预算受限的情况下提升感知性能。该方案采用x265/HEVC编码器实现基础视频流,JPEG格式传输ROI静止图像,形成一个可复现且实用的混合传输范式。
链接: https://arxiv.org/abs/2605.01826
作者: Natalia Trukhina,Vadim Vashkelis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages, 2 figures, 4 tables
Abstract:Bandwidth-constrained robotic and surveillance systems often rely on a single compressed video stream to support both continuous scene awareness and downstream machine perception. In practice, this creates a mismatch: low-bitrate video can preserve motion and coarse context, but often loses the fine local detail needed for reliable object recognition and decision-making. Motivated by a hybrid architecture in which low-resolution video supports dynamic scene understanding while eventdriven high-detail regions of interest (ROIs) support close-up identification and analytics, this paper formalizes a two-channel visual telemetry scheme in which a continuous low-bitrate video stream is augmented by selectively transmitted high-detail still ROIs. This first paper does not attempt to prove the superiority of a new still-image codec. Instead, it establishes the hybrid transmission paradigm itself using a practical and reproducible codec stack: x265/HEVC for the base video stream and JPEG stills for ROI refinement. We formulate the problem as bitrate-constrained information selection for robotic vision and define an experimental protocol in which video-only and hybrid schemes are compared under matched total communication budgets. The study is designed around UAV-oriented datasets, two practical bitrate regimes, several ROI triggering policies, and object-level classification refinement on selectively transmitted ROI stills. The resulting paper lays the methodological foundation for a second-stage investigation of JPEG AI as the semantic still-image channel within the same hybrid architecture.
[CV-108] Cross-Domain Adversarial Augmentation: Stabilizing GANs for Medical and Handwriting Data Scarcity
【速读】:该论文旨在解决视觉任务中因数据稀缺导致的模型性能瓶颈问题,特别是在低资源场景下,如孟加拉语手写字符识别和胸部X光图像分析。其解决方案的关键在于利用生成对抗网络(Generative Adversarial Networks, GANs)进行生成式增强(generative augmentation),通过训练DCGAN风格的模型在64×64分辨率下生成合成数据,并将其与真实数据混合用于下游分类器训练。实验表明,该方法显著提升了样本多样性并带来了在小样本条件下的稳定性能增益,同时通过梯度惩罚目标和谱归一化等技术增强了训练稳定性,并系统评估了合成数据比例、样本过滤策略及医学图像评估中的潜在风险与局限性。
链接: https://arxiv.org/abs/2605.01815
作者: Md. Sohanuzzaman Soad,Mahady Al Hady,S M Rafiuddin Rifat,Sudip Ghose
机构: University of Asia Pacific (亚洲太平洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 15 figures, 3 tables
Abstract:Generative Adversarial Networks (GANs) offer a pragmatic route to mitigate data scarcity in vision tasks. We study generative augmentation across two low-resource domains: Bangla handwritten characters and chest X-ray imaging using DCGAN-style models trained at 64x64 resolution. We evaluate fidelity and diversity via Inception Score (IS), Fr’echet Inception Distance (FID), and embedding visualizations (t-SNE/UMAP), and assess downstream utility by training classifiers on real versus real synthetic data. Our experiments show that generative augmentation improves sample diversity and yields consistent gains in classifier performance under limited-data regimes. We analyze stability enhancements (e.g., gradient-penalized objectives and spectral normalization) and report ablations on synthetic-to-real ratios and sample filtering. We discuss evaluation caveats for medical images, dataset licensing, and privacy risks associated with synthetic data. The resulting protocol is simple to reproduce and provides a strong baseline for applying generative augmentation to resource-constrained imaging tasks.
[CV-109] Embody4D: A Generalist 4D World Model for Embodied AI
【速读】:该论文旨在解决当前具身世界模型(embodied world models)普遍局限于二维表示、缺乏多视角信息以支持具身空间推理的问题。核心挑战包括多视角数据稀缺、生成三维几何结构的时空一致性难以维持,以及对操作细节易产生幻觉。解决方案的关键在于提出Embody4D,一个专用于具身场景的视频到视频世界模型:首先通过引入3D感知的组合合成流程构建异构数据集,缓解数据稀缺问题;其次设计自适应噪声注入策略,利用图像区域置信度差异选择性正则化扩散过程,保障时空一致性;最后引入交互感知注意力机制,显式关注机器人交互区域,提升操作细节的真实性。
链接: https://arxiv.org/abs/2605.01799
作者: Peiyan Tu,Hanxin Zhu,Jingwen Sun,Shaojie Ren,Cong Wang,Jiayi Luo,Xiaoqian Cheng,Zhibo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad generalization. Second, to enforce geometric stability, we devise an adaptive noise injection strategy; by leveraging confidence disparities across image regions, this method selectively regularizes the diffusion process to ensure strict spatiotemporal consistency. Finally, to guarantee manipulation fidelity, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments demonstrate that Embody4D achieves state-of-the-art performance, serving as a robust world model that synthesizes high-fidelity, view-consistent videos to empower downstream robotic planning and learning.
[CV-110] MedScribe: Clinically Grounded CT Reporting through Agent ic Workflows
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动化胸部CT影像报告生成中因依赖全局嵌入压缩而导致的幻觉性描述和解剖定位不准确的问题。其解决方案的关键在于提出MedScribe框架,将报告生成重构为一种假设驱动的迭代证据获取过程,而非单次编码任务;该框架通过大语言模型动态调用特定病理诊断工具提取局部体积特征,并利用结构化特征查询与病理特异性文本证据对齐的多维检索空间,在合成前显式累积定量证据,从而实现细粒度的解剖学锚定并减少未经支持的陈述,无需任务特定微调即可提升临床准确性、事实一致性和可解释性。
链接: https://arxiv.org/abs/2605.01779
作者: Giuseppe A. Orlando,Paolo Papotti,Maria A. Zuluaga,Olivier Humbert,Marco Lorenzi
机构: Inria Centre at Université Côte d’Azur (法国国家信息与自动化研究院地中海中心); EURECOM (欧洲电信学院); Dep. of Nuclear Medicine, Centre Antoine Lacassagne (安托万·拉卡桑中心核医学系); Université Côte d’Azur, CNRS, Inserm, iBV (蔚蓝海岸大学、法国国家科学研究中心、法国国家健康与医学研究院、生物医学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) have shown potential for automated radiology report generation, yet existing approaches rely on global embedding compression of volumetric data, often leading to hallucinated findings and limited anatomical grounding in 3D CT imaging. We introduce MedScribe, a hypothesis-driven framework that reformulates report generation as an iterative evidence acquisition process rather than a single-pass encoding task. MedScribe models reporting as a sequential decision process in which a large language model dynamically invokes pathology-specific diagnostic tools to extract localized volumetric features. These structured features are used to query a multidimensional retrieval space aligned with pathology-specific textual evidence. By explicitly accumulating quantitative evidence prior to synthesis, the framework enforces fine-grained grounding and reduces unsupported claims. Without task-specific fine-tuning, MedScribe improves clinical accuracy, factual consistency, and interpretability on CT-RATE and RadChestCT compared to state-of-the-art 2D and 3D VLMs, demonstrating the value of hypothesis-driven reasoning for reliable medical image reporting.
[CV-111] Mitigating Multimodal LLM s Hallucinations via Relevance Propagation at Inference Time
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因模态利用失衡而导致的幻觉问题,即模型过度依赖文本语言先验而忽视感知输入(如视觉或音频)的 grounded 证据。解决方案的关键在于提出一种无需训练的框架 LIME(Learning Inference-time Modality Enhancement),其核心机制是通过 Layer-wise Relevance Propagation (LRP) 量化 token 级别的贡献,并设计基于相关性的目标函数,在推理阶段通过对 key-value 表示的更新来增强对感知输入的依赖,从而提升多模态 grounding 能力,且不改变模型参数或需要额外训练数据。
链接: https://arxiv.org/abs/2605.01766
作者: Itai Allouche,Joseph Keshet
机构: Technion (以色列理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:Multimodal large language models (MLLMs) have revolutionized the landscape of AI, demonstrating impressive capabilities in tackling complex vision and audio-language tasks. However, a critical challenge remains: these models often suffer from hallucinations, generating outputs that diverge from the provided perceptual inputs. This tendency stems from an inherent imbalance in modality utilization during inference, where the dominance of textual tokens undermines the potential of perceptual inputs. As a result, the model frequently resorts to textual language priors at the expense of grounded evidence. To tackle this issue, we propose Learning Inference-time Modality Enhancement (LIME), a training-free framework designed to bolster multimodal grounding by explicitly enhancing modality usage during decoding. LIME leverages Layer-wise Relevance Propagation (LRP) to quantify token-level contributions and defines a relevance-based objective that promotes increased reliance on perceptual inputs. This objective is enforced through inference-time updates to the model’s key-value representations, without modifying model parameters or requiring additional training data. We evaluate LIME across multiple multimodal benchmarks in both vision and audio domains, demonstrating consistent reductions in hallucinations and enhanced grounding while preserving generation quality. Further analysis shows that LIME increases modality contribution and produces more localized and semantically aligned relevance patterns.
[CV-112] rajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成模型在安全防护方面的三大挑战:显式不安全提示、对抗性改写(jailbreak)攻击以及时间上涌现的风险(temporally emergent risk),后者指原本看似无害的提示因模型在时间维度上的语义扩展而生成有害内容。解决方案的关键在于提出一种无需训练、仅在推理阶段生效的防御框架TrajShield,其核心思想是将T2V安全问题建模为在时序结构语义空间中的因果干预:通过模拟提示的潜在演化轨迹,定位风险的因果源头,并施加最小侵入性的语义重写,从而在保留无关安全语义的前提下有效消除风险。
链接: https://arxiv.org/abs/2605.01761
作者: Quanchen Zou,Nizhang Li,Wenxin Zhang,Jiaye Lin,Yangchen Zeng,Xiangzheng Zhang,Zonghao Ying
机构: 360 AI Security Lab(360人工智能安全实验室); Macau University of Science and Technology(澳门科技大学); University of Chinese Academy of Sciences(中国科学院大学); Tsinghua University(清华大学); Southeast University(东南大学); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-Video (T2V) models have demonstrated remarkable capability in generating temporally coherent videos from natural language prompts, yet they also risk producing unsafe content such as violence or explicit material. Existing prompt-level defenses are largely inherited from text-to-image safety and operate on the lexical surface of the input, making them vulnerable to jailbreak attacks that disguise harmful intent through rephrasing or adversarial prompting. Moreover, T2V generation introduces a distinctive challenge overlooked by prior work: temporally emergent risk, where a seemingly benign prompt leads to unsafe content through the generator’s temporal extrapolation toward narrative coherence. We propose \method, a training-free, inference-time defense framework that reformulates T2V safety as a causal intervention in a temporally structured semantic space. TrajShield handles explicit unsafe prompts, jailbreak attacks, and temporally emergent risks in a unified manner by simulating the implied trajectory of a prompt, localizing the causal origin of potential risk, and applying a minimally invasive rewrite that neutralizes the risk while preserving safety-irrelevant semantics. Experiments on T2VSafetyBench across 14 safety categories and multiple T2V backends demonstrate that TrajShield achieves state-of-the-art defenseive performance while maintaining high semantic fidelity, substantially outperforming existing defenses, with an average ASR reduction of 52.44%.
[CV-113] PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
【速读】:该论文旨在解决场景级点云自监督学习(Scene-level Point Cloud Self-Supervised Learning, PC-SSL)中因样本独立建模范式导致的跨场景语义表示不一致问题,该问题阻碍了统一且可迁移的语义空间构建。解决方案的关键在于提出一种基于跨样本语义传播(Cross-sample Semantic Propagation, CSP)的PC-SSL框架,通过将批次内样本序列化并输入状态空间模型(State-space Model),显式建模状态空间中样本间的动态依赖关系,从而在潜在空间中建立跨样本语义一致性并实现全局语义对齐;此外,为缓解序列化预训练带来的批处理依赖性问题,进一步引入非对称语义保持蒸馏(Asymmetric Semantic Preservation Distillation, SPD),利用异构输入机制与语义特征对齐约束,在微调阶段保障语义迁移的结构一致性,使模型在单场景测试条件下仍具备鲁棒的语义一致性表现。
链接: https://arxiv.org/abs/2605.01759
作者: Xinxing Yu,Ajian Liu,Sunyuan Qiang,Hui Ma,Liying Yang,Yuzhong Wang,Zhi Rao,Yanyan Liang
机构: Macau University of Science and Technology (澳门科技大学); Southwest Institute of Technical Physics (西南技术物理研究所); The Institute of Automation of the Chinese Academy of Sciences (中国科学院自动化研究所); Great Bay University (大湾区大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: conference
Abstract:Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances in the field through existing methods, the sample-independent modeling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space and achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.
[CV-114] Profile-Specific 3DMM Regression from a Single Lateral Face Image CVPR2026
【速读】:该论文旨在解决从单张侧脸RGB图像中实现高精度3D人脸重建的问题,尤其针对正畸临床中基于侧位X光片的头影测量标志点分析所面临的辐射暴露风险。传统方法依赖2D特征(如眼睛、嘴部、耳廓和轮廓线)进行标志点检测,未能充分利用面部轮廓与下颌线等关键3D几何信息;同时,现有基于学习的3D可变形模型(3DMM)回归器多在近正面视角(yaw ≈ 0°)训练和评估,难以应对极端侧脸视角(yaw ≈ 90°)下因遮挡严重导致的边界线索主导问题。解决方案的关键在于:首先构建了ProfileSynth数据集,通过在极端偏航角范围内采样FLAME形状与姿态参数,并利用条件扩散模型生成逼真的侧脸图像(以深度图和法向量图为引导);其次提出一个面向侧脸特性的FLAME回归基线,并引入可见性感知的下颌线正则化策略,从而有效提升极端侧脸视角下的3D重建精度,为非侵入式头影测量分析提供了一个实用且可扩展的基准框架。
链接: https://arxiv.org/abs/2605.01746
作者: Taiki Kanaya,Hideo Saito
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CV4Clinic Workshop at CVPR 2026. Project page: this https URL
Abstract:Single-image 3D face reconstruction is a core problem in computer vision, with important clinical applications such as cephalometric landmark analysis in orthodontics. Traditionally, this analysis relies on lateral X-ray imaging; however, frequent X-ray exposure is impractical due to radiation concerns. While recent research has explored detecting landmarks from lateral RGB images as an alternative, existing methods typically rely on 2D features such as the eyes, mouth, ears, and boundary silhouettes, failing to fully exploit the underlying 3D facial geometry spanning the facial profile and jawline, which is essential for accurate diagnosis. Meanwhile, although 3D face reconstruction from frontal views has seen significant progress, most learning-based 3D morphable model (3DMM) regressors are developed and benchmarked on near-frontal images, where appearance cues are abundant. In extreme profile views (yaw \approx 90^\circ ), much of the face is occluded, and the available signal is dominated by boundary cues, making accurate 3D reconstruction challenging. In this paper, we bridge this gap with geometry-conditioned synthetic data and a simple profile-specific FLAME regression baseline for single lateral images. We introduce ProfileSynth, a dataset created by sampling FLAME shape and pose parameters in extreme yaw ranges and generating photorealistic profile images using a diffusion model conditioned on depth and normal maps. We further study a profile-specific baseline with visibility-aware jawline regularization. Our framework provides a practical baseline for “profile \times 3DMM” reconstruction and a promising foundation for more accurate, non-invasive cephalometric analysis from lateral RGB images.
[CV-115] MOC-3D: Manifold-Order Consistency for Text-to-3D Generation
【速读】:该论文旨在解决当前基于Score Distillation Sampling(SDS)的文本到3D生成方法中存在的两大核心问题:一是宏观拓扑不一致性(如Janus问题),源于2D扩散先验的视角偏差;二是微观几何不连续性,由高Classifier-Free Guidance(CFG)带来的梯度噪声所引发。解决方案的关键在于提出MOC-3D框架,通过两个模块协同优化:其一为语义视图顺序约束模块(Semantic View-Order Constraint Module),利用CLIP先验在不同视角间施加单调性排序约束,以引导3D对象全局拓扑结构的一致性;其二为基于流形的特征连续性模块(Manifold-based Feature Continuity Module),借助对称正定(SPD)流形上的黎曼度量,从统计意义上衡量多视角特征分布的距离,从而促进微纹理在多视角间的平滑演化与连续性。此宏-微协同优化机制显著提升了生成3D模型的结构一致性和细节连续性。
链接: https://arxiv.org/abs/2605.01743
作者: Chenyang Fan,Junshi Cheng,Wen Yang,Zihong Li,Wenfeng Zhang,Wei Hu,Yi Zhang,Pan Zeng
机构: Chongqing Normal University (重庆师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:With the burgeoning development of fields such as the Metaverse, Virtual Reality (VR), and Digital Twins, text-to-3D generation has emerged as a research hotspot in both academia and industry. Currently, optimization methods based on Score Distillation Sampling (SDS) utilizing 2D diffusion priors have become the mainstream technological paradigm in this field. However, due to the view bias of 2D priors and the mode-seeking ambiguity combined with gradient noise induced by high Classifier-Free Guidance (CFG), these methods still suffer from macro-topological inconsistency (e.g., the Janus problem) and micro-geometric discontinuity. To address these challenges, we propose MOC-3D, a text-to-3D generation method based on geometric manifold and semantic view-order consistency. Built upon the ScaleDreamer framework, our method incorporates a Semantic View-Order Constraint Module and a Manifold-based Feature Continuity Module. The former aims to rectify macro-topological inconsistency, while the latter focuses on eliminating micro-geometric discontinuity. Specifically, the Semantic View-Order Constraint Module leverages the prior knowledge of CLIP to impose a Monotonicity Rank Constraint on semantic score representations across different views, thereby providing effective guidance for the global topological structure of 3D objects. Meanwhile, the Manifold-based Feature Continuity Module employs the Riemannian Metric on the Symmetric Positive Definite (SPD) manifold. By measuring the distance of feature statistical distributions in the Riemannian space, it promotes the smooth evolution and continuity of micro-textures across multi-views in a statistical sense. Under the macro-micro synergistic optimization of these two modules, our model can simultaneously improve macro-structural consistency and micro-detail continuity. Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2605.01743 [cs.CV] (or arXiv:2605.01743v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.01743 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805622.3810761 Focus to learn more DOI(s) linking to related resources
[CV-116] Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在资源受限的工业环境中部署困难的问题,其核心挑战包括高计算成本、内存占用大和能耗高。为实现高效部署同时保持准确率,作者提出了一种多维协同优化框架,关键在于联合优化三个互补维度:模型架构(通过神经架构搜索 AutoFormer 识别紧凑骨干网络)、token 处理(利用 ToMe 方法进行 token 合并以减少信息处理量)以及位宽精度(采用 fp16 混合精度推理加速单操作执行)。该方案在 ImageNet-1K 上验证了压缩效率与精度的权衡,并进一步应用于半导体封装缺陷检测的实际工业任务,实现了超过 10 倍的吞吐量提升及参数量、浮点运算次数(FLOPs)和能耗的显著降低,同时维持下游任务所需的准确率。
链接: https://arxiv.org/abs/2605.01742
作者: Phat Nguyen,Xue Geng,Kaixin Xu,Wang Zhe,Xulei Yang,Ngai-Man Cheung
机构: Singapore University of Technology and Design (SUTD); Agency for Science, Technology, and Research (A*STAR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) have achieved strong performance in visual recognition, yet their deployment in resource-constrained industrial environments remains limited. Some main challenges are their high computational cost, memory requirement, and energy consumption. While individual efficiency techniques such as neural architecture search (NAS), token compression, and low-precision inference have been extensively studied, most prior work targets only a single optimization axis, limiting overall deployment gains while preserving accuracy. In this paper, we present one of the first holistic frameworks that jointly optimizes three complementary axes: architecture, token, and bit-width. Specifically, the framework identifies compact backbones via Neural Architecture Search (AutoFormer), reduces information processing via token merging (ToMe), and accelerates per-operation execution via fp16 mixed-precision inference. Starting from a DeiT-B/16 baseline, we first analyze accuracy-efficiency trade-offs on ImageNet-1K under aggressive compression. Then, we apply the selected configurations to a real-world in-house 3D X-ray semiconductor defect classification dataset for IC chip packaging inspection. Results show that the proposed multi-axis framework achieves more than 10 times improvement in throughput along with over 10 times reductions in parameter count, FLOPs, and energy consumption, while maintaining the required accuracy on the downstream industrial task. To the best of our knowledge, this is among the earliest works to jointly optimize architecture, token, and bit-width dimensions in ViTs and the first such resource-efficient, deployment-focused study tailored to semiconductor manufacturing.
[CV-117] Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
【速读】:该论文旨在解决生成式 AI (Generative AI) 在牙科锥形束计算机断层扫描(Cone Beam Computed Tomography, CBCT)三维分析中因标注数据稀缺而导致模型泛化能力受限的问题。其核心挑战在于标准自监督学习(Self-supervised Learning, SSL)方法如掩码图像建模(Masked Image Modeling, MIM)采用随机掩码策略,无法聚焦于具有诊断价值的关键区域(如细微病灶或复杂解剖边界),从而限制了模型对三维结构变化的表征能力。解决方案的关键是提出一种新颖的自适应掩码策略 ATMask:通过计算跨切片纹理变化图识别高结构复杂度区域,并在预训练阶段有选择性地掩码这些区域,迫使模型学习更丰富的上下文信息以推断复杂的三维形态演变,从而实现更高效、强大的表示学习。
链接: https://arxiv.org/abs/2605.01741
作者: Xinquan Yang,Jianfeng Ren,Xuguang Li,Kian Ming Lim,He Meng,Linlin Shen,Yongqiang Deng
机构: Shenzhen University (深圳大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校); Shenzhen University General Hospital (深圳大学总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cone Beam Computed Tomography (CBCT) is pivotal for 3D diagnostic imaging in dentistry. However, the development of robust AI models for volumetric analysis is often constrained by the scarcity of large, annotated datasets. Self-supervised learning (SSL), particularly Masked Image Modeling (MIM), offers a promising pathway to leverage unlabeled data. A limitation of standard MIM is its reliance on random masking, which fails to prioritize diagnostically critical regions in dental CBCT volumes, such as subtle pathological changes and intricate anatomical boundaries. To address this, we propose ATMask, a novel adaptive masking strategy. Instead of applying random masks or employing computationally intensive attention modules, ATMask computes an inter-slice texture variation map to identify regions with high structural or textural complexity. These high-variation areas are then selectively masked during pre-training, compelling the model to learn richer contextual representations essential for inferring complex 3D morphological transitions. Furthermore, we contribute the first large-scale CBCT dataset, curated from both public and private sources, comprising 6,314 scans, for the dental AI model pretraining. Extensive experiments on three downstream dental CBCT tasks demonstrate that our ATMask enables more data-efficient and powerful representation learning than standard random masking and other advanced SSL baselines. The dataset and code will be released.
[CV-118] Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning CVPR2026
【速读】:该论文旨在解决现有语义地图构建方法在显式几何表达与多尺度语义信息之间存在权衡、且缺乏与大模型原生接口的问题,导致需额外训练特征投影以实现语义对齐。其解决方案的核心是提出多尺度高斯-语言地图(GLMap),关键设计包括:(1) 显式几何表示,(2) 覆盖实例和区域概念的多尺度语义建模,以及 (3) 双模态接口——每个语义单元同时存储自然语言描述和三维高斯表示(3D Gaussian representation)。该结构支持紧凑存储与基于高斯泼溅(Gaussian splatting)的快速任务相关图像渲染,并通过高斯估计器实现无需梯度优化的增量式高效构建,从而在ObjectNav、InstNav和SQA任务中显著提升目标导航与情境推理能力,且可零样本兼容大模型方法。
链接: https://arxiv.org/abs/2605.01736
作者: Sixian Zhang,Yiyao Wang,Xinhang Song,Keming Zhang,Zijian Xu,Shuqiang Jiang
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing(中国科学院计算技术研究所人工智能安全重点实验室,北京); University of Chinese Academy of Sciences, Beijing(中国科学院大学,北京); Institute of Computing Technology, Chinese Academy of Sciences, Beijing(中国科学院计算技术研究所,北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Understanding the geometric and semantic structure of environments is essential for embodied navigation and reasoning. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics, and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target navigation and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner. The code is available at this https URL.
[CV-119] GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在进行 grounded reasoning 时存在的对象幻觉(object hallucination)问题,特别是针对现有方法中将自动生成的图像描述(caption)视为统一正样本资源的做法所导致的性能下降现象。研究发现,直接嵌入这些 caption 可能会显著降低模型准确性(如 Qwen2.5-VL-3B 在 HallusionBench 上下降近 10 分),其根本原因在于 caption 的结构特性:一方面它不仅影响最终答案,还塑造推理路径与词汇选择;另一方面,错误类型具有不对称性——遗漏远多于虚构,但每条虚构错误对单个实例的影响更大。解决方案的关键是提出 GEASS(Gated Evidence-Aware Selective Steering),一个无需训练的模块,通过三个机制动态决定每个查询应吸收多少 caption 信息:基于干净路径置信度进行门控、依据熵减少量加权 caption、并在两条推理路径分歧时提高证据阈值,从而实现更鲁棒的推理控制。
链接: https://arxiv.org/abs/2605.01733
作者: Zeshang Li,Shuoyang Zhang,Jiashen Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help–dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions anchor not only the model’s final answer but also its reasoning trajectory and lexical choices. Second, caption errors are asymmetric: omissions vastly outnumber fabrications, yet each fabrication carries a much larger per-instance impact. A caption’s usefulness is therefore a per-query property, not a per-corpus one. We propose GEASS (Gated Evidence-Aware Selective Steering), a training-free module that decides on each query how much of the caption the model consumes: it gates the caption by the clean path’s confidence, weights it by the entropy reduction it produces, and raises the evidence bar when the two pathways disagree. Experiments on POPE and HallusionBench across four VLMs show that GEASS consistently improves over vanilla inference and contrastive decoding, with only two extra forward passes per query.
[CV-120] Motion-Aware Caching for Efficient Autoregressive Video Generation
【速读】:该论文旨在解决自回归视频生成(autoregressive video generation)中因逐帧迭代去噪带来的高计算开销问题。现有缓存复用策略依赖粗粒度的块级跳过机制,无法捕捉像素级别的运动动态,导致静态区域过度跳过或运动区域跳过不足,进而引发误差累积。其解决方案的关键在于提出MotionCache框架,通过引入运动感知的缓存复用机制,利用帧间差异作为轻量级代理来估计像素级运动特性,并采用“粗到精”的策略:先通过初始预热阶段确保语义一致性,再基于运动权重动态调整每个token的缓存更新频率,从而实现精准的细粒度缓存跳过。实验表明,该方法在SkyReels-V2和MAGI-1模型上分别实现6.28×和1.64×的加速,同时保持生成质量几乎不变(VBench得分下降分别仅1%和0.01%)。
链接: https://arxiv.org/abs/2605.01725
作者: Jing Xu,Yuexiao Ma,Songwei Liu,Xuzhe Zheng,Shiwei Liu,Chenqian Yan,Xiawu Zheng,Rongrong Ji,Fei Chao,Xing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of \textbf6.28\times and \textbf1.64\times respectively, while effectively preserving generation quality (VBench: 1%\downarrow and 0.01%\downarrow respectively). The code is available at this https URL.
[CV-121] Dual-branch Robust Unlearnable Examples ICML2026
【速读】:该论文旨在解决现有不可学习样本(Unlearnable Examples, UEs)在面对先进防御机制时鲁棒性不足的问题,其根源在于现有方案依赖启发式设计或局限于特定域的扰动。解决方案的关键在于提出一种双分支不可学习集成(DUNE)方法,通过在空间域和颜色域分别优化扰动,建立扰动与迁移导致标签之间的映射关系;该设计扩展了扰动域以提升噪声强度,并促使模型学习扰动导向特征从而降低泛化能力,实现更强的不可学习性。此外,引入基于预训练模型的集成策略进一步增强扰动优化效果,在CIFAR-10和ImageNet上验证了其对7种主流防御机制的优越鲁棒性,平均测试准确率降至14.95%至50.82%。
链接: https://arxiv.org/abs/2605.01718
作者: Xianlong Wang,Hangtao Zhang,Wenbo Pan,Ziqi Zhou,Changsong Jiang,Li Zeng,Xiaohua Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Unlearnable examples (UEs) aim to compromise model training by injecting imperceptible perturbations to clean samples. However, existing UE schemes exhibit limited robustness against advanced defenses due to their heuristic design or narrowly scoped domain perturbations. To address this, we propose \textttDUNE, a \underline\textbfDual-branch \underline\textbfUNlearnable \underline\textbfEnsemble perturbation optimization approach. Specifically, \textttDUNE separately optimizes perturbations in the spatial and color domains to establish the mapping between perturbations and shift-induced labels. This design extends the perturbation domain to increase noise intensity for improving robustness and drives the models to learn perturbation-oriented features with degraded generalization, thereby achieving unlearnability. To strengthen \textttDUNE’s performance, we further propose an unlearnability-enhancing ensemble strategy that aggregates diverse pre-trained models during the dual-branch optimization. Extensive experiments on benchmark datasets CIFAR-10 and ImageNet verify that \textttDUNE’s robustness outperforms 12 SOTA UE schemes under 7 mainstream defenses, yielding a lower average test accuracy of 14.95% to 50.82%.
[CV-122] Linear-Time Global Visual Modeling without Explicit Attention
【速读】:该论文旨在解决Transformer模型中注意力机制因显式计算注意力权重而导致的二次计算复杂度问题,这一瓶颈限制了其在长序列任务中的效率。解决方案的关键在于将注意力机制重新数学建模为一种带有动态预测参数的多层感知机(Multi-Layer Perceptron, MLP),从而将原本显式的token间聚合过程转化为隐式的全局上下文压缩表示——即通过动态生成的参数实现对全局信息的高效编码。实验表明,仅依赖动态参数化即可在保持线性复杂度的前提下实现与Transformer相当的序列建模能力,为高效序列建模提供了新范式。
链接: https://arxiv.org/abs/2605.01711
作者: Ruize He,Dongchen Han,Gao Huang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention’s global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at this https URL.
[CV-123] Exploring Entropy-based Active Learning for Fair Brain Segmentation
【速读】:该论文旨在解决医学图像分割中主动学习(Active Learning, AL)方法在提升整体性能的同时忽视群体间公平性的问题,特别是针对具有敏感属性(如性别、种族等)的子群体可能出现的性能差异。其核心挑战在于如何在有限标注预算下,避免因初始数据分布不平衡导致模型对某些亚群体的分割性能显著劣于其他群体。解决方案的关键在于提出一种公平感知的主动学习框架,其中引入加权熵(Weighted Entropy)选择策略,通过动态调整不确定性度量以反映各子群体当前在已标注集上的表现估计;同时,采用受限于感兴趣区域(Region of Interest, ROI)的掩码缩放熵(masked, scaled entropy),将真实认知不确定性(epistemic uncertainty)与解剖结构体积差异分离,从而更精准地识别需优先标注的低性能子群体。实验表明,该方法能显著降低群体间性能差距,在强偏倚和弱偏倚场景下分别减少75%和86%的不公平指标,实现了更高的公平性-性能平衡。
链接: https://arxiv.org/abs/2605.01706
作者: Ghazal Danaee,Mélanie Gaillochet,Christian Desrosiers,Herve Lombaert,Sylvain Bouix
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures. Accepted as a poster at Medical Imaging with Deep Learning (MIDL) 2026. OpenReview submission: 221
Abstract:Active learning (AL) has emerged as a crucial strategy for reducing the prohibitive costs associated with medical image segmentation. However, standard uncertainty-based AL methods typically focus on maximizing performance metrics, ignoring performance disparities or fairness across groups with sensitive attributes. While fair active learning has been explored in classification tasks, its intersection with medical image segmentation remains unaddressed. In this work, we introduced a fairness-aware active learning framework with a Weighted Entropy selection strategy that modulates uncertainty based on current group-specific performance estimates on the labeled set. To decouple true epistemic uncertainty from anatomical volume variances, we further utilized a masked, scaled entropy restricted to the region of interest. The framework was evaluated on synthetic T1-weighted brain MRIs with controlled left caudate bias in both strong and weak bias settings. A 3D U-Net was trained to segment the left caudate under several AL strategies, starting from both demographically balanced and strongly imbalanced initial labeled sets. Experiments demonstrated that our method markedly reduces performance disparities between groups compared to random sampling and standard uncertainty sampling. By prioritizing poorly segmented subgroups during the AL cycles, our method consistently achieved the highest equity-scaled performance and reduced the disparity metric by 75% (strong bias) and 86% (weak bias) relative to standard entropy at the final budget. Overall, this work is among the first studies on fair AL for medical image segmentation, offering an efficient strategy to train more equitable models in resource-constrained environments.
[CV-124] rajRAG : Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation CVPR2026
【速读】:该论文旨在解决零样本目标导航(Zero-shot Object Goal Navigation, ObjectNav)中因依赖互联网规模文本获取常识知识而导致的环境感知局限性问题,以及导航过程中产生的经验数据未被有效积累和复用的问题。其核心解决方案是提出Trajectory RAG(TrajRAG),一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架,通过构建拓扑-极坐标(topo-polar)轨迹表示结构来紧凑编码空间布局与语义上下文,并采用分层分块机制组织相似轨迹以实现粗粒度到细粒度的高效检索。在导航过程中,候选前沿点生成多个轨迹假设并查询TrajRAG获取历史相似轨迹,从而引导大模型进行路径点选择;同时新获得的经验持续整合进TrajRAG,实现了终身导航经验的累积与迭代优化。
链接: https://arxiv.org/abs/2605.01700
作者: Yiyao Wang,Sixian Zhang,Keming Zhang,Xinhang Song,Songjie Du,Shuqiang Jiang
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, Beijing; Institute of Computing Technology, Chinese Academy of Sciences, Beijing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience. To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience. Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.
[CV-125] IMPACT-Scribe: Interactive Temporal Action Segmentation with Boundary Scribbles and Query Planning
【速读】:该论文旨在解决过程性活动视频中密集时间标注(dense temporal annotation)的劳动密集型问题,传统方法依赖反应式工具,每次修正被视为孤立操作,导致无法有效利用标注者不确定性与模型可靠性信息。其解决方案的关键在于提出一种以修正驱动的闭环框架IMPACT-Scribe,通过不确定性感知的边界涂鸦监督(uncertainty-aware boundary scribble supervision)、局部提议建模、成本感知查询规划、结构化传播以及修正驱动适应等模块,使每一次修正都能提升后续人机协作效率与标注质量,从而实现更高质量的边界精度和持续优化的人机协同关系。
链接: https://arxiv.org/abs/2605.01668
作者: Qian Yin,Di Wen,Kunyu Peng,David Schneider,Zeyun Zhong,Alexander Jaus,Zdravko Marinov,Jiale Wei,Ruiping Liu,Junwei Zheng,Yufan Chen,Chen Zhang,Lei Qi,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); INSAIT (INSAIT); Sofia University “St. Kliment Ohridski” (索非亚大学“克莱门特·奥霍里斯基”); ETH Zurich (苏黎世联邦理工学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures. Code is available at this https URL
Abstract:Dense temporal annotation of procedural activity videos is vital for action understanding and embodied intelligence but remains labor-intensive due to reactive tools. Each correction is treated as an isolated edit, limiting reuse of information on annotator uncertainty and model reliability. We introduce IMPACT-Scribe, a correction-driven framework for dense labeling that uses each correction to improve future human-machine collaboration. IMPACT-Scribe combines uncertainty-aware boundary scribble supervision, local proposal modeling, cost-aware query planning, structured propagation, and correction-driven adaptation. Experiments and a human study show that this closed-loop design improves labeling quality per effort, enhances boundary accuracy, and fosters better human-machine interaction over time. The code will be made publicly available at this https URL.
[CV-126] Deep neural networks with Fisher vector encoding for medical image classification
【速读】:该论文旨在解决卷积神经网络(CNN)在小样本数据下性能受限,以及纯CNN模型存在局部性偏差(locality bias)的问题,同时探索如何将更丰富的特征表示方法与混合CNN+视觉Transformer(Vision Transformer, ViT)架构相结合,以提升模型在不同规模数据集上的泛化能力。其解决方案的关键在于引入Fisher向量(Fisher Vectors)作为无序编码方法,用于增强混合CNN+ViT架构的特征表达能力,并提出一种限制高斯混合模型(Gaussian Mixture Model, GMM)估计计算成本随数据规模增长的方法,从而使得该编码策略在大规模数据场景下仍具可行性。
链接: https://arxiv.org/abs/2605.01667
作者: Lucas O. Lyra,Antonio E. Fabris,Joao B. Florindo
机构: University of São Paulo (圣保罗大学); Institute of Mathematics and Statistics, University of São Paulo (圣保罗大学数学与统计研究所); State University of Campinas (坎皮纳斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Orderless encoding methods have shown to improve Convolutional Neural Networks (CNNs) for image classification in the context of limited availability of data. Additionally, hybrid CNN + Vision Transformers (ViT) models have been recently proposed to address CNN locality bias issues. These models outperformed CNN-only approaches. Despite that, the integration of such hybrid models with more elaborated feature representation can be highly beneficial and remains large unexplored in the literature. In this context, we propose the introduction of an orderless encoding method, Fisher Vectors, to hybrid CNN + ViT architectures, aiming at achieving a model suitable for both small and large datasets. Such enconding method relies on estimating a Gaussian Mixture Model (GMM) on image features. In large datasets, computational costs of the GMM estimation is a limiting factor for the application of Fisher Vectors. Thus, we propose a method to limit the growth of GMM estimation costs as we increase the size of the dataset. We explore the feasibility of our method in the context of medical image classification by appling it to MedMNIST (v2), Clean-CC-CCII and ISIC2018. This collection of datasets contains a wide variety of data scales and modalities. We outperform benchmark results in all MedMNIST (v2) datasets and obtain literature-competitive results in Clean-CC-CCII and ISIC2018.
[CV-127] IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction
【速读】:该论文旨在解决从第一人称程序视频中高质量标注人类-物体交互(Human-Object Interaction, HOI)结构化事件图的问题,以支持机器人通过人类示范学习操作技能。其核心挑战在于如何在保证标注准确性的前提下提升标注效率,并避免自动化决策与人工确认之间的冲突。解决方案的关键在于提出IMPACT-HOI这一混合主动性框架,它通过一个基于信任校准的控制器动态选择直接查询、人工确认建议或保守完成策略,结合原子级回滚的风险约束执行协议,确保人工确认的决策不被后续自动化更新覆盖,从而实现高效且可靠的增量式事件状态构建。
链接: https://arxiv.org/abs/2605.01666
作者: Haoshen Zhang,Di Wen,Kunyu Peng,David Schneider,Zeyun Zhong,Alexander Jaus,Zdravko Marinov,Jiale Wei,Ruiping Liu,Junwei Zheng,Yufan Chen,Yufeng Zhang,Yuanhao Luo,Lei Qi,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); INSAIT (INSAIT); Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”); ETH Zurich (苏黎世联邦理工学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 2 figures. Code is available at this https URL
Abstract:We present IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human-Object Interactions (HOI), motivated by the need for high-quality structured supervision for learning robot manipulation from human demonstration. IMPACT-HOI frames this task as the incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk-bounded execution protocol, utilizing atomic rollback, ensures that human-confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations under the studied protocol. The code will be made publicly available at this https URL.
[CV-128] Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models ICCV2025
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, VLMs)在长视频问答(long-form video question answering, video QA)任务中因采用标准均匀采样(uniform sampling)导致的帧选择效率低、计算成本高且性能易饱和的问题。其解决方案的关键在于引入视频主动感知(Video Active Perception, VAP),这是一种无需训练的方法,将关键帧选择建模为一种主动数据获取过程,利用轻量级文本条件视频生成模型来表征先验世界知识,从而引导模型聚焦于与问题语义相关的高信息量帧。实验表明,VAP在多个长视频推理基准上实现了最先进的零样本性能,并在帧效率上相比GPT-4o、Gemini 1.5 Pro和LLaVA-OV提升高达5.6倍,同时增强了模型的推理能力并提升了关键帧的相关性。
链接: https://arxiv.org/abs/2605.01662
作者: Martin Q. Ma,Willis Guo,Aditya Agrawal,Ankit Gupta,Paul Pu Liang,Ruslan Salakhutdinov,Louis-Philippe Morency
机构: Carnegie Mellon University (卡内基梅隆大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 workshop
Abstract:Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.
[CV-129] RIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning
【速读】:该论文旨在解决视频摘要(video summarization)任务中现有方法依赖昂贵人工标注、跨领域泛化能力弱以及计算成本高的问题,尤其针对无监督和弱监督方法在捕捉长时程时间依赖性和语义结构方面性能不足的瓶颈。其解决方案的关键在于提出一种名为TRIMMER(Temporal Relative Information Maximization for Multi-objectensive Efficient Reinforcement)的自监督强化学习框架,该框架分为两个阶段:首先通过自监督学习获取鲁棒的视频表示,随后利用基于信息论奖励函数的强化学习进行时空决策;与以往基于相似性的目标不同,TRIMMER引入熵基度量以捕获高阶时间动态和语义多样性,并直接在选中帧索引上计算奖励,从而提升计算效率。实验表明,该方法在无监督和自监督类别中达到最先进性能,同时在可扩展性和泛化性上优于主流监督方法。
链接: https://arxiv.org/abs/2605.01659
作者: Pritam Mishra,Coloma Ballester,Dimosthenis Karatzas
机构: Universitat Pompeu Fabra (庞培法布拉大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.
[CV-130] Act2See: Emergent Active Visual Perception for Video Reasoning CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视频推理中依赖静态初始帧、难以动态融入视觉信息的问题,尤其针对现有链式思维(Chain-of-Thought, CoT)方法在整合额外帧信息时质量不佳且无法合成假设或反事实场景视觉证据的局限。解决方案的关键在于提出Act-to-See框架,通过监督微调(Supervised Fine-Tuning, SFT)训练一个高质量推理轨迹数据集,其中包含VLM主动调用现有帧或生成新帧的决策过程,并结合人工标注的CoT进行严格验证,从而在推理阶段使模型具备自主判断何时检索或合成必要视觉证据的能力,实现主动视觉感知(active visual perception),显著提升视频理解性能。
链接: https://arxiv.org/abs/2605.01657
作者: Martin Q. Ma,Yuxiao Qu,Aditya Agrawal,Willis Guo,Paul Pu Liang,Ruslan Salakhutdinov,Louis-Philippe Morency
机构: Carnegie Mellon University (卡内基梅隆大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.
[CV-131] SteeringDiffusion: A Bottlenecked Activation Control Interface for Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)中内容与风格控制难以精确、平滑调节的问题,尤其是在保持模型主干(如U-Net)冻结的前提下实现细粒度的激活层级控制。其解决方案的关键在于提出SteeringDiffusion方法:通过学习一个小型、提示条件化的潜在代码,并将其投影为FiLM或AdaGN风格的调制参数,从而在不重新训练模型的情况下,以单个标量在推理阶段连续、平滑地调整内容与风格之间的权衡。该设计采用零初始化确保在无干预时等价于原始模型,同时引入时间步感知门控机制限制调制仅作用于去噪后期,从而实现稳定且可解释的控制路径。
链接: https://arxiv.org/abs/2605.01653
作者: Fangzheng Wu,Brian Summa
机构: Tulane University (杜兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce SteeringDiffusion, a bottlenecked activation-level control interface for diffusion models that exposes a smooth, monotonic, and runtime-adjustable control surface over the content–style trade-off. Our method keeps the U-Net backbone frozen and learns a small, prompt-conditioned latent code projected to FiLM/AdaGN-style modulation parameters. A zero-initialized design guarantees exact equivalence to the base model at zero scale, while timestep-aware gating restricts modulation to later denoising stages. A single scalar at inference continuously traverses the control surface without retraining. Across experiments on Stable Diffusion~1.5 and SDXL covering multiple artistic styles, we show that SteeringDiffusion produces smooth and monotonic content–style trade-offs. Under matched parameter budgets, it outperforms LoRA in controllability and stability, while ControlNet and rank-1 adapters do not expose a comparable control surface. We further introduce an inversion-stability diagnostic based on DDIM inversion, used as a post-hoc trajectory probe, which reveals strong correlations with intervention magnitude. These results position \emphSteering Bottlenecked Explicit Control (S-BEC) as a practical, general-purpose control interface for frozen diffusion backbones.
[CV-132] Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection CVPR2026
【速读】:该论文旨在解决当前多模态深度伪造(Multimodal Deepfakes)在社交媒体环境中日益泛滥所带来的真实性与信息完整性威胁,以及现有基准测试在单模态局限性、简化篡改方式和非现实分布上的不足,导致无法有效评估检测模型在真实场景下的鲁棒性问题。解决方案的关键在于提出Omni-Fake——一个统一的多模态深度伪造检测数据集,包含大规模高质量样本(Omni-Fake-Set)和专门设计的分布外(Out-of-Distribution, OOD)测试集(Omni-Fake-OOD),覆盖图像、音频、视频及音视频说话头像四种模态,并支持联合检测、定位与解释协议;在此基础上进一步开发Omni-Fake-R1,一种基于强化学习驱动的多模态检测器,能够自适应融合视觉与听觉线索,输出结构化决策、定位结果及自然语言解释,从而显著提升检测准确率、跨模态泛化能力与可解释性。
链接: https://arxiv.org/abs/2605.01638
作者: Tianxiao Li,Zhenglin Huang,Haiquan Wen,Yiwei He,Xinze Li,Bingyu Zhu,Wuhui Duan,Congang Chen,Zeyu Fu,Yi Dong,Baoyuan Wu,Jason Li,Guangliang Cheng
机构: University of Liverpool (利物浦大学); University of Exeter (埃克塞特大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: this https URL
[CV-133] Unifying Deep Stochastic Processes for Image Enhancement
【速读】:该论文旨在解决当前图像增强领域中基于深度随机过程(deep stochastic processes)的方法缺乏统一理论框架的问题,特别是不同方法在条件随机轨迹建模上的差异与联系尚不清晰。其解决方案的关键在于提出一个统一的连续时间随机微分方程(SDE)视角,将现有方法归类为三类过程:无条件扩散模型、Ornstein-Uhlenbeck (OU) 过程和扩散桥(diffusion bridges),并证明它们均源于同一SDE形式。这一框架明确指出各方法的核心区别体现在漂移项(drift)、扩散项(diffusion)、终态分布及边界条件上,而调度器(scheduler)和采样器(sampler)则属于独立的设计选择。通过此统一性,作者实现了跨任务的受控实验比较,揭示了影响性能的关键设计因素,并开源了ItoVision库以支持快速原型开发与公平比较。
链接: https://arxiv.org/abs/2605.01568
作者: Wojciech Kozłowski,Radosław Kuczbański,Kamil Adamczewski,Karol Szczypkowski,Maciej Zięba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, in proceesings of the 43rd International Conference on Machine Learning, Seoul, South Korea
Abstract:Deep stochastic processes have recently become a central paradigm for image enhancement, with many methods explicitly conditioning the stochastic trajectory on the degraded input. However, the relationship between these conditional processes and standard diffusion models remains unclear. In this work, we introduce a unified perspective on stochastic image enhancement by classifying recent methods into three families of continuous-time processes: unconditional diffusion models, Ornstein-Uhlenbeck (OU) processes, and diffusion bridges. We show that all of these approaches arise from a common stochastic differential equation (SDE) formulation. This framework makes explicit that seemingly disparate methods differ primarily in their drift and diffusion terms, terminal distributions, and boundary conditions, while schedulers and samplers constitute orthogonal design choices. Leveraging this unification, we conduct a controlled empirical study across multiple image enhancement tasks using identical architectures and training protocols. Our results reveal no consistently dominant method; instead, we identify and disentangle the specific design choices that most strongly influence performance. Finally, we release ItoVision, a modular PyTorch library that implements the unified framework and enables rapid prototyping and fair comparison of stochastic image enhancement methods.
[CV-134] Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation Classification and Detection
【速读】:该论文旨在解决多模态、异构医学影像数据中模型泛化能力弱的问题,尤其是在不同数据分布下分割(segmentation)、分类(classification)和目标检测(object detection)任务性能不稳定的问题。解决方案的关键在于提出一种统一的跨域迁移学习框架,采用教师-学生范式:由一个联合教师模型从多个源数据集中学到领域不变表示(domain-invariant representations),并通过多层次知识蒸馏(multi-level knowledge distillation)指导任务特定的学生模型训练,从而实现对多种医学图像分析任务的通用性增强与性能提升。
链接: https://arxiv.org/abs/2605.01563
作者: Ceausescu Ciprian-Mihai,Anghelina Ion-Marian,Alexe Dumitru-Bogdan
机构: University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal extension from the KES paper
Abstract:We propose a unified cross-domain transfer learning framework that leverages knowledge from multiple heterogeneous medical imaging datasets to improve performance across segmentation, classification, and object detection tasks. Our approach employs a teacher-student paradigm in which a joint teacher model aggregates domain-invariant representations learned from diverse source datasets, while a task-specific student model is trained via multi-level knowledge distillation. Originally developed for medical image segmentation, the framework is extended to support image-level classification and object-level detection, enabling a general multi-task formulation for medical image analysis. We evaluate our method on a broad suite of datasets, including six segmentation benchmarks, BrainMetShare, ISLES, BraTS (MRI) and Lung MSD, LiTS, KiTS (CT), as well as multiple classification datasets for pulmonary disease and dementia, and detection datasets with native bounding-box annotations. Across all tasks and modalities, the proposed approach yields consistent improvements over strong dataset-specific and multi-head baselines, demonstrating enhanced robustness to distributional shifts and superior generalization. These findings highlight the potential of multi-dataset knowledge distillation as a scalable and task-agnostic approach for enhancing segmentation, classification, and object detection performance across heterogeneous medical imaging domains.
[CV-135] Robust Fundamental Matrix Estimation from Single Image Motion Blur
【速读】:该论文旨在解决从单张运动模糊图像中提取基础矩阵(fundamental matrix)的问题,该矩阵编码了相机在曝光期间的三维运动信息。传统方法依赖于清晰图像中的对应点,而本研究创新性地利用模糊图像中由运动引起的条纹路径(smear paths)作为时间域内的对应关系线索,从而实现对相机运动的建模。解决方案的关键在于:首先,提出一种适用于时间方向不确定对应关系的基础矩阵估计方法,克服了经典8点算法因模糊导致的时间方向歧义(per-smear ambiguity)失效的问题;其次,引入对条纹模式预测不确定性的度量,并将其融入估计器的采样过程中,显著提升了基础矩阵估计的鲁棒性。实验表明,该方法可在单帧图像上准确估计出反映3D相机运动的基础矩阵,并在运动分割等下游任务中展现出实际应用价值。
链接: https://arxiv.org/abs/2605.01552
作者: Bao-Long Tran,Per-Erik Forssén,Fredrik Viksten
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, under submission
Abstract:In this paper, we introduce a challenging task: extracting a fundamental matrix from a single motion blurred image. For a camera moving in 3D during exposure, the smear paths in the blurry image contain cues and constraints on this motion. We demonstrate the feasibility of establishing correspondences between two time instances within the camera exposure window, and that these can be used to robustly infer a fundamental matrix, which summarizes the motion of the camera during the exposure time. The inferred fundamental matrix is unique up to a transpose, corresponding to an ambiguity of the direction of time. Due to this per-smear ambiguity, classic methods, such as the 8-point algorithm, are no longer usable. The proposed method modifies the estimation to work on time-direction ambiguous correspondences. To improve the robustness of the fundamental matrix estimation, we also propose to incorporate an uncertainty measurement in smear pattern prediction and use it in the sampling process of the estimator. Experiments on synthetic and real-world motion-blur datasets demonstrate that our approach is able to estimate the fundamental matrix encoding the 3D camera motion, from single frames. Practical applicability is demonstrated on the downstream task of motion segmentation.
[CV-136] ECG-biometrics-bench: A Unified Framework for Reproducible Benchmarking of ECG Biometrics
【速读】:该论文旨在解决现有心电图(ECG)生物识别研究中因数据泄露(如同一会话内的随机划分)导致性能评估过于乐观的问题,从而误导对真实场景下系统可靠性的判断。其关键解决方案是提出一个模块化、可复现的基准测试框架——ECG-biometrics-bench,该框架统一了预处理、分割和评估流程,并在七个广泛使用的公共ECG数据集上支持闭集与开集(即主体不重叠泛化)评估,以及逐步逼近现实的协议(如跨会话和长期时间分离)。通过这一框架,作者揭示了“随机划分谬误”(Random Split Fallacy),证明了仅使用同会话数据会导致性能虚高,而忽视了由时间漂移和未见身份带来的严重性能下降;同时验证了当前监督特征学习范式下的模型(如DeepECG、ResNet1D、CNN-LSTM)均存在此类问题,表明其根源并非模型结构差异,而是方法论缺陷。此外,研究还提出一种基于动态多会话模板融合的轻量级认证策略,在重注册条件下可部分缓解时间老化带来的性能衰减,为ECG生物识别在真实穿戴设备中的可靠部署提供了更合理的基准与方向。
链接: https://arxiv.org/abs/2605.01548
作者: Milad Parvan
机构: Milad Parvan
Independent Researcher
Milan, Italy
miladparvan72@gmail.com
未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Under review
Abstract:Electrocardiogram (ECG) biometrics have emerged as a promising modality for continuous, liveness-aware authentication in wearable systems. However, many prior studies report overly optimistic results due to data leakage (e.g., random splits within the same session). To address this issue, we introduce ECG-biometrics-bench, a modular, reproducible benchmarking framework that standardizes preprocessing, segmentation, and evaluation across seven widely used public ECG datasets spanning clinical, ambulatory, and large-scale cohort settings. The framework supports both closed-set and open-set (i.e., subject-disjoint generalization in this work) evaluation, as well as progressively realistic protocols including cross-session and long-term temporal separation. To facilitate reproducible research in the community, the ECG-biometrics-bench repository will be made publicly accessible on GitHub upon the acceptance of this manuscript. Through a comprehensive multi-dataset analysis, we expose the Random Split Fallacy, demonstrating that intra-session evaluation protocols artificially inflate performance while masking severe degradation caused by temporal drift and unseen identities. Furthermore, by evaluating multiple architectures, including DeepECG, ResNet1D, and CNN-LSTM, we show that these failures are not model-specific but are likely inherent to current supervised feature-learning paradigms. Finally, we demonstrate that performance degradation due to temporal aging can be partially mitigated through a heavy enrollment, lightweight authentication strategy based on dynamic multi-session template fusion. These findings establish a more realistic baseline for ECG biometrics and highlight critical challenges that must be addressed for reliable real-world deployment.
[CV-137] Certified vs. Empirical Adversarial Robust-ness via Hybrid Convolutions with Attention Stochasticity
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中长期存在的问题:即在 L₂ 范数下的可证明鲁棒性(provable robustness)与对抗攻击下的实际鲁棒性(empirical robustness)之间的显著差距。现有方法往往难以同时提升两种鲁棒性指标,且常以牺牲干净样本准确率为代价。解决方案的关键在于提出 Hybrid Convolutions with Attention Stochasticity (HyCAS),其核心创新是将确定性与随机性原理融合:通过耦合 1-Lipschitz、谱归一化卷积与两个随机组件——谱归一化随机投影滤波器和随机注意力噪声机制,构建一种随机化防御架构;该设计使整体网络满足 2-Lipschitz 约束,从而获得形式化证书,并在多个图像基准数据集上实现了认证准确率提升最高达 7.3%、经验鲁棒性提升最高达 3.1%,且不损失干净准确率。
链接: https://arxiv.org/abs/2605.01519
作者: Joy Dhar,Song Xia,Manish Kumar Pandey,Maryam Haghighat,Azadeh Alavi,Ferdous Sohel,Wenyu Zhang,Nayyar Zaidi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Hybrid Convolutions with Attention Stochasticity (HyCAS), an adversarial defense that narrows the long-standing gap between provable robustness under L2 certificates and empirical robustness against strong L attacks, while preserving strong generalization across diverse imaging benchmarks. HyCAS unifies deterministic and randomized principles by coupling 1-Lipschitz, spectrally normalized convolutions with two stochastic components, spectral normalized random, projection filters and a randomized attention-noise mechanism, to realize a randomized defense. Injecting smoothing randomness inside the architecture yields an overall = 2-Lipschitz network with formal certificates. Exten-sive experiments on diverse imaging benchmarks, including CIFAR-10/100, ImageNet-1k, NIH Chest X-ray, HAM10000, show that HyCAS surpasses prior leading certified and empirical defenses, boosting certified accuracy by up to 7.3% (on NIH Chest X-ray) and empirical robustness by up to 3.1% (on HAM10000), without sacrificing clean accuracy. These results show that a randomized Lipschitz constrained architecture can simultaneously improve both certified L2 and empirical L adversarial robustness, thereby supporting safer deployment of deep models in high-stakes applications. Code: this https URL
[CV-138] VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation ICML2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在文本到可缩放矢量图形(Scalable Vector Graphics, SVG)动画生成任务中的两大核心挑战:一是如何在离散的代码表示与连续的视觉动态之间建立有效映射,二是如何在保持 SVG 文档对象模型(DOM)结构完整性的同时实现几何层面的非刚性形变建模。现有基于优化的方法常破坏拓扑一致性,而通用大语言模型(Large Language Models, LLMs)依赖僵化的 CSS/SMIL 变换,无法捕捉精细的几何运动。解决方案的关键在于提出 VAnim 框架,其创新性地将动画视为对持久 SVG DOM 树的稀疏状态更新(Sparse State Updates, SSU),从而将序列长度压缩超过 9.8 倍并天然保留未参与变化的元素;同时引入“识别优先”运动规划机制(Identification-First Motion Planning),通过显式视觉实体锚定文本指令以实现精确控制,并采用渲染感知的强化学习策略优化(Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization, GRPO),结合视频感知编码器提供的混合奖励信号,使离散代码更新与高保真视觉反馈对齐。
链接: https://arxiv.org/abs/2605.01517
作者: Guotao Liang,Zhangcheng Wang,Chuang Wang,Juncheng Hu,Haitao Zhou,Junhua Liu,Jing Zhang,Dong Xu,Qian Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026. Project page: this https URL
Abstract:Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimization-based methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while preserving the SVG DOM structure and non-participating elements by construction. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, with additional appendix metrics further validating motion quality and identity preservation.
[CV-139] wo-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
【速读】:该论文旨在解决真实监控视频中交通事故的联合定位问题(即同时精确定位事故发生的时间、空间位置和碰撞类型),这是一个罕见事件(rare-event)问题,通常因标注数据获取困难而无法进行端到端的监督训练。其核心解决方案是提出一种无需微调(no-fine-tuning)的流水线方法,关键在于两个创新:一是采用“粗粒度到细粒度”的两阶段分解策略——先以1 fps全视频扫描生成初步时空坐标与类型预测(t, x, y, c),再在±3秒窗口内以5 fps精细重审并结合两个确定性置信度门控机制防止边界异常;二是引入专用角色分工机制,由Qwen3-VL-Plus负责视频片段的视觉接地(grounding),Gemini 3.1 Flash-Lite负责中心裁剪片段的碰撞类型识别(typing)。该方案在ACCIDENT@CVPR 2026基准上实现了显著性能提升,达到ACC^S = 0.539,远超基线模型。
链接: https://arxiv.org/abs/2605.01512
作者: Jiantang Huang
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4+R pages, 2 figures, 3 tables
Abstract:Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper’s best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~ 20.
[CV-140] SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion CVPR26
【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的图像个性化生成方法在实际应用中效率低下的问题,尤其是现有方法普遍依赖计算密集型微调、迭代优化或多步去噪过程,严重限制了其在实时交互场景中的部署能力。解决方案的关键在于提出SwiftPie——首个单步扩散图像个性化工具,其核心创新是引入一种新颖的双分支身份注入机制(dual-branch identity injection mechanism),能够高效地将主体身份信息嵌入到单步扩散过程中;同时结合掩码引导的重缩放策略(mask-guided rescaling strategy),进一步提升主体在单次扩散步骤内的上下文一致性与保真度,从而实现高速且高质量的个性化图像生成。
链接: https://arxiv.org/abs/2605.01510
作者: Huy Duong,Trong-Tung Nguyen,Cuong Pham,Anh Tran,Khoi Nguyen,Minh Hoai
机构: Qualcomm AI Research (Qualcomm人工智能研究); Posts Telecommunication Institute of Technology (Posts电信技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR26 Finding
Abstract:Diffusion models have achieved remarkable success in high-quality image synthesis, sparking interest in image-guided generation tasks such as subject-driven image personalization. Despite their impressive personalization results, existing methods typically rely on computationally intensive fine-tuning, iterative optimization, or multi-step denoising processes, which significantly hinder their deployment and interactive capability in real-time applications. In this work, we present SwiftPie, the first one-step diffusion image personalization tool that enables lightning-fast generation of personalized images. SwiftPie introduces a novel dual-branch identity injection mechanism that effectively integrates subject identity into a one-step diffusion model. In addition, we incorporate a mask-guided rescaling strategy to further enhance subject contextualization within a single diffusion step. Extensive experiments demonstrate that SwiftPie not only delivers superior image personalization speed but also achieves comparable performance with multi-step approaches in both identity fidelity and prompt alignment. This work opens new opportunities for real-time, high-quality personalized image generation, paving the way for interactive visual synthesis.
[CV-141] OmniEncoder: See Hear and Feel Continuous Motion Like Humans With One Encoder
【速读】:该论文旨在解决当前多模态大语言模型在视觉与音频联合理解中因模态特异性编码器导致的感知割裂问题,即现有架构采用“视频粗粒度、音频细粒度”的采样策略(如视频帧以1–2 fps采样,音频波形以25 fps处理),使得模型无法实现人类般的整体化感知,且在编码阶段缺乏充分的跨模态交互,难以捕捉精细的视觉运动信息。其解决方案的关键在于提出Omni-Encoder——一个统一的Transformer骨干网络,能够在共享潜在空间中以对称的25 fps频率同步嵌入视觉和音频信号,并通过三项核心技术实现优化:Omni-Encoder Token Template(统一令牌模板)、Omni-RoPE(改进的位置编码机制)以及Temporal Window Shifting(时间窗口移位策略),从而在保持计算效率的同时增强模态解耦能力与跨模态交互深度。实验表明,该方案在细粒度视觉连续理解任务(如手语识别、体育动作分析)上显著优于基线模型Qwen2.5-Omni,同时在标准音视频基准测试(如AVQA、说话人识别与定位)上保持竞争力,验证了统一多模态编码对于构建更贴近人类感知整合机制的多模态模型具有重要意义。
链接: https://arxiv.org/abs/2605.01506
作者: Detao Bai,Shimin Yao,Weixuan Chen,Chengen Lai,Yuanming Li,Zhiheng Ma,Xihan Wei
机构: Tongyi Lab Alibaba Group; Shenzhen University of Advanced Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a \emphvideo-coarse, audio-dense design – sampling visual frames at 1–2 fps while processing audio waveforms at 25 fps – resulting in systems that perceive video \emphframe by frame, modality by modality rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present \textbfOmni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps within a shared latent space. This architecture leverages three core innovations – the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting – to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks – such as sign language recognition and fine-grained sports action analysis – while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.
[CV-142] RADMI: Latent Information Aggregation as a Proxy for Model Uncertainty ICIP2026
【速读】:该论文旨在解决深度学习系统在密集预测任务(如分割)中难以高效估计认知不确定性(epistemic uncertainty)的问题。现有方法通常依赖计算成本高昂的集成方法或多轮随机前向传播,难以扩展至高分辨率场景。其解决方案的关键在于提出一种单次前向传播的方法——分辨率聚合解码器互信息(RADMI),通过测量编码器-解码器结构中连续解码层之间的互信息(mutual information, MI)来量化不确定性;研究发现,层间MI升高与预测不确定性正相关,尤其在类别边界等模糊区域,网络需整合冲突上下文信息。RADMI无需架构修改即可生成边界敏感的高精度不确定性图,并在地震相位分割基准上显著优于其他单次传播方法,在皮尔逊和斯皮尔曼相关系数上分别提升5.5%和10.7%。
链接: https://arxiv.org/abs/2605.01502
作者: William Stevens,Mohit Prabhushankar,Ghassan AlRegib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, 3 tables, accepted to IEEE ICIP 2026
Abstract:Epistemic uncertainty estimation is essential for identifying regions where deep learning system outputs may be unreliable. However, existing approaches require computationally expensive ensemble methods or multiple stochastic forward passes, limiting their scalability to dense prediction tasks like segmentation. We propose Resolution-Aggregated Decoder Mutual Information (RADMI), a single-pass method that estimates prediction uncertainty by measuring mutual information (MI) between consecutive decoder layers in segmentation networks. We observe that elevated inter-layer MI correlates with prediction uncertainty, as the network must integrate conflicting contextual information at ambiguous regions such as class boundaries. Evaluating on a seismic facies segmentation benchmark, RADMI achieves the highest correlation with deep ensemble uncertainty among all single-pass methods, outperforming the next-best baselines by 5.5% in Pearson and 10.7% in Spearman correlation coefficients. Compared to baselines that either lack spatial precision or demand significant computational overhead, RADMI yields sharp, boundary-localized uncertainty maps without architectural modifications. Our results suggest that linear aggregation of normalized information flow provides a principled and efficient proxy for prediction uncertainty in encoder-decoder architectures.
[CV-143] owards Visual Query Localization in the 3D World CVPR2026
【速读】:该论文旨在解决三维空间中视觉查询定位(3D Visual Query Localization, 3DVQL)问题,即在包含点云、RGB图像和深度图等多模态数据的3D序列中,准确定位与给定查询最相关的时空响应区域。当前研究主要集中在二维视频中的视觉查询定位,而对三维场景下的此类任务关注甚少。解决方案的关键在于提出了首个面向3D多模态视觉查询定位的基准数据集3DVQL,其包含2002个高质量标注的序列、约17万帧及6400段响应轨迹,并设计了一种名为LaF(lift-and-attention fusion)的新型融合算法,通过提升特征表示能力和注意力机制显著优于现有基线模型,从而推动该领域的发展。
链接: https://arxiv.org/abs/2605.01498
作者: Liang Peng,Bohan Tan,Zhipeng Zhang,Haobo Li,Yifan Jiao,Xingping Dong,Libo Zhang
机构: Wuhan University (武汉大学); AutoLab, SAI, Shanghai Jiao Tong University (上海交通大学智能汽车研究所); Anyverse Dynamics; University of Chinese Academy of Sciences (中国科学院大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. 8 pages
Abstract:Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal visual query localization. To facilitate comparison in subsequent research, we implement a series of representative 3D multimodal VQL baselines using point clouds and RGB images. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift-and-attention fusion algorithm named LaF, which significantly outperforms existing baseline models. Our benchmark and model will be publicly released at this https URL.
[CV-144] SF20K Competition 2025: Summary and findings
【速读】:该论文旨在解决长视频理解中故事层面的问答(story-level video question-answering)问题,突破传统短片段动作识别的局限,推动模型对长时序、多镜头叙事结构的理解能力。其解决方案的关键在于构建了一个基于业余短片的开放性视频问答任务(open-ended video question-answering task),并采用基于GPT-4.1-nano的自动化评判系统LLM-QA-Eval进行评估,从而避免模型依赖于对热门电影的记忆。实验表明,叙事感知的逐镜头处理优于均匀采样帧的方法;设计良好的多阶段小模型流水线可媲美甚至超越参数量大30倍以上的端到端模型;而字幕质量成为影响性能的主要因素之一,说明当前瓶颈在于信息选择与推理结构的设计,而非单纯模型容量。
链接: https://arxiv.org/abs/2605.01496
作者: Ridouane Ghermi,Xi Wang,Vicky Kalogeiton,Ivan Laptev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.
[CV-145] CGFformer: Cluster-Guidance Frequency Transformer for Pansharpening
【速读】:该论文旨在解决当前基于频率的全色锐化(pansharpening)方法中存在的两个核心问题:一是主流方法采用固定频率滤波器,难以适应全色(PAN)与多光谱(MS)图像中复杂且空间分布多变的频率特性;二是现有去噪策略未能充分挖掘频域信息,难以精准抑制多种噪声类型。解决方案的关键在于提出CGFformer模型,其核心创新包括:1)设计自适应分离模块,通过K均值聚类融合局部特征与非局部信息,实现高频与低频成分的更精确分离;2)引入双流精修模块结合基于Transformer的交叉注意力机制,协同抑制与频率相关的和无关的干扰噪声;3)构建频域-空域融合模块,增强细节保留并促进频域与空间信息的交互,从而提升融合结果的空间结构重建效果。
链接: https://arxiv.org/abs/2605.01490
作者: Zijian Zhou,Jianing Zhang,Kai Sun,Xiangyu Zhao,Chunxia Zhang,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 12 pages
Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images. However, the current mainstream frequency-based pansharpening methods employ fixed frequency filters, which cannot precisely adapt to complex and spatially diversified frequency distributions in PAN and MS images. Furthermore, existing denoising strategies insufficiently exploit frequency components for denoising and struggle to suppress various noise types accurately. To address these challenges, we propose CGFformer, a cluster-guidance frequency Transformer that focuses on varying frequency distribution and interactions between frequency and spatial components. Specifically, we design an adaptive separation module that integrates local features and non-local information through K-means clustering, enabling more precise separation of high- and low-frequency components. Subsequently, we introduce a dual-stream refinement module combined with Transformer-based cross-attention to remove various noise, allowing the network to jointly suppress frequency-relevant and irrelevant disturbances. In addition, we develop a frequency-spatial fusion module designed to enhance detail and facilitate spatial-frequency interaction, ensuring more effective reconstruction of spatial structures in the fused results. Extensive experiments on multiple benchmark datasets demonstrate that the proposed CGFformer achieves notable improvements over existing pansharpening approaches.
[CV-146] Research on Vision-Language Question Answering Models for Industrial Robots
【速读】:该论文旨在解决工业机器人视觉-语言问答(Vision-Language Question Answering, VLQA)中的语义模糊性、复杂环境布局以及领域特定术语等挑战。其解决方案的关键在于提出了一种分层跨模态融合模型,通过集成先进的目标检测、多尺度视觉编码、句法解析与任务感知语义注意力机制,将视觉与语言信号统一到联合推理空间中;具体而言,基于区域的深度网络提取视觉特征,加权嵌入聚合与循环神经网络句法解析共同构建结构化语义表示,并借助自适应融合与交叉注意力机制实现细粒度语义对齐,从而显著提升系统在操作查询、指令步骤识别和异常检测任务中的可靠性与鲁棒性。
链接: https://arxiv.org/abs/2605.01483
作者: Ping Li,Bartlomiej Brzozka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 Pages, 5 figures
Abstract:A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, and domain-specific terminology common in modern manufacturing. The framework integrates advanced object detection, multi-scale visual encoding, syntactic parsing, and task-aware semantic attention to unite vision and language signals into a joint reasoning space. Region-based deep networks extract visual features, weighted embeddings aggregate, and recurrent neural parsing encodes sentence structures. Through fine-grained semantic alignment driven by adaptive fusion and cross-attention mechanisms, the system can handle operational queries, instruction steps, and anomaly detection with higher reliability. Compared to the existing VLQA benchmarks, validation experiments conducted on the IVQA and RIF benchmarks indicate improvements in semantic alignment, Top-1 accuracy, and robustness to ambiguous or procedural task queries. Ablation studies further quantify the impact of each architectural module, confirming the necessity of multi-level feature integration and context-driven gating for dependable industrial deployment. The technical advancements reported here provide core methodologies to improve the interpretability and operational effectiveness of industrial robots faced with diverse human-robot interaction tasks.
[CV-147] AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT
【速读】:该论文旨在解决训练-free图像编辑(training-free image editing)中如何高效、稳定地利用多模态扩散Transformer(MMDiT)架构实现高质量编辑的问题。其核心挑战在于:传统方法如MasaCtrl在MMDiT上易出现提示不匹配(prompt-mismatch)失效问题,且缺乏对不同编辑类型自适应选择最优注意力操作的能力。解决方案的关键在于提出两个创新机制:一是KVInject,一种单前向传播的注意力注入方法,通过在局部层/步带内将源图像键值(Key/Value)投影以alpha混合方式注入噪声侧,避免了两阶段处理和提示失配;二是AttnRouter,一个基于类别路由的注意力调度机制,根据编辑类型动态选择最能保留源结构的注意力操作,即使使用零样本CLIP分类器也能逼近人工标注路由的性能。实验表明,早期去噪步骤(S0–7)和早期层带(L0–15)中的K/V注入最为有效,且α∈[0.3, 0.5]为稳定最优区间,同时揭示了UNet经验规则(如简单K/V缩放)在MMDiT中并不适用。
链接: https://arxiv.org/abs/2605.01480
作者: Guandong Li,Mengxia Ye
机构: iFLYTEK; Aegon THTF
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.
[CV-148] CSGuard: Toward Forgery-Resistant Watermarking in Diffusion Models via Compressed Sensing Constraint
【速读】:该论文旨在解决基于潜在空间的扩散模型水印(latent-based diffusion model watermarking)在面对伪造攻击(forgery attack)时的脆弱性问题,即攻击者可通过图像反演并使用任意提示词重新生成图像来提取水印,从而实现恶意内容的虚假归属。解决方案的关键在于提出 CSGuard,这是首个具备抗伪造能力的水印方案,其核心机制是利用压缩感知(compressed sensing)技术将图像生成与验证过程绑定至一个秘密矩阵(secret matrix),确保仅持有该矩阵的合法用户才能正确嵌入或验证水印,从而有效阻止非法用户的伪造行为,同时不损害生成质量与水印完整性。
链接: https://arxiv.org/abs/2605.01479
作者: Jiewei Lai,Lan Zhang,Chen Tang,Pengcheng Sun,Zhaopeng Zhang,Yunhao Wang,Hui Jin
机构: University of Science and Technology of China(中国科学技术大学); Institute of Artificial Intelligence(人工智能研究院); Hefei Comprehensive National Science Center(合肥综合性国家科学中心); Lenovo Research(联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent-based diffusion model watermarking embeds watermarks into generated images’ latent space to enable content attribution, offering a training-free solution for intellectual property protection and digital forensics. However, these methods exhibit a critical vulnerability to the forgery attack, attackers can extract the watermark by inverting the watermarked image and re-generating it with an arbitrary prompt, thereby enabling false attribution on malicious content. In this paper, we propose the CSGuard, the first forgery-resistant watermarking schema that leverages compressed sensing to bind the watermarked image generation and verification to a secret matrix. This ensures that only users possessing the secret matrix can correctly embed or verify the image watermark, prevents the illegal users from forgery without compromising generation quality and watermark integrity. Experimental results demonstrate that CSGuard achieves strong forgery resistance, reduces the attack success rate from 100.0% to 28.12%, and achieve 100% detection rate on benign watermarked images without compromising watermarking effectiveness.
[CV-149] LIE: LiDAR-only HD Map Construction with Intensity Enhancement via Online Knowledge Distillation ITSC
【速读】:该论文旨在解决纯LiDAR数据在高精地图(HD map)语义分割中因缺乏密集语义和纹理信息而导致的性能瓶颈问题。现有方法多依赖多视角相机图像以获取语义信息,但相机无法提供精确的深度信息;而LiDAR虽具备高精度3D几何测量能力,却难以直接生成细粒度语义标签。解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation, KD)的LiDAR-only语义地图构建方法——LIE,其核心创新是设计了一个教师分支,通过融合学生LiDAR特征与对应的二维强度图(intensity map tile),实现在线蒸馏(online distillation)机制,从而为LiDAR提供密集监督信号,显著提升语义分割精度。实验表明,该方法在nuScenes数据集上相较最先进相机模型mIoU提升8.2%,且在长距离、恶劣天气和光照条件下具有鲁棒性,并能高效适应Argoverse2数据集,仅需10%的微调即可超越全量训练的相机模型。
链接: https://arxiv.org/abs/2605.01478
作者: Kanak Mazumder,Fabian B. Flohr
机构: Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted for publication in International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2026. The final published version will be available via IEEE Xplore
Abstract:Online High-Definition (HD) map construction is a key component of autonomous driving. Recent methods rely on multi-view camera images for cost-effective HD map segmentation, but cameras lack depth information for accurate scene geometry. In contrast, LiDAR provides precise 3D measurements but lacks dense semantic cues. In this work, we propose LIE, LiDAR-only semantic map construction method that employ Knowledge Distillation (KD) to handle the lack of dense semantic and texture cues. Specifically, the teacher branch fuses student LiDAR features and the corresponding 2D intensity map tile to provide dense supervision for segmenting map elements using online distillation scheme. Experimental results show that our method outperforms all single-modality approaches, achieving 8.2% higher mIoU than the state-of-the-art camera-based model on nuScenes. LIE is robust over long ranges and under challenging weather and lighting, and efficiently adapts to Argoverse2 with only 10% fine-tuning, surpassing camera-based models trained on the full dataset. Source code will be available \hrefthis https URLhere.
[CV-150] Decision Boundary-aware Generation for Long-tailed Learning CVPR2026
【速读】:该论文旨在解决长尾数据分布下分类器决策边界偏向头部类别、导致尾部类别准确率下降的问题。现有方法如基于扩散模型的生成增强虽能缓解数据稀缺问题,但其生成样本仍受原始长尾分布偏差影响;进一步引入的“从头到尾迁移”策略虽有助于平衡分类器的决策空间,却会引发潜在的非局部特征混杂(latent non-local feature mixing),造成类间特征纠缠和决策边界重叠,进而引起尾部类别分布偏移。论文的关键解决方案是提出决策边界感知生成(Decision Boundary-aware Generation, DBG)框架,通过生成靠近决策边界的有信息量样本,强化边界附近的表征学习,从而在重平衡数据集的同时,提升决策空间的可分性,显著减少类间重叠并改善尾部类别性能。
链接: https://arxiv.org/abs/2605.01468
作者: Jiacheng Yang,Ruichi Zhang,Chikai Shang,Mengke Li,Xinyi Shang,Junlong Gao,Yonggang Zhang,Yang Lu
机构: Xiamen University (厦门大学); Shenzhen University (深圳大学); University College London (伦敦大学学院); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:Long-tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating additional data, while head-to-tail transfer further mitigate the generator bias inherit from long-tailed dataset. However, we show that while head-to-tail transfer helps balance the decision space of the classifier, it also induces latent non-local feature mixing that entangles inter-class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary-aware Generation (DBG) framework, which promotes near-boundary representation learning by generating informative near-boundary samples. Overall, DBG rebalances the long-tailed dataset while yielding more separable decision space for long-tailed learning. Across standard long-tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter-class overlap. The code of DBG is available at this https URL.
[CV-151] SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion ICML2026
【速读】:该论文旨在解决多模态学习中点云补全(Point Cloud Completion, PCC)任务里跨模态连接不明确的问题,特别是传统硬投影(hard projection)导致的视觉先验传播受阻现象,即所谓的跨模态熵崩溃(Cross-Modal Entropy Collapse)。其解决方案的关键在于提出 SplAttN 方法,通过将硬投影替换为可微高斯点绘(Differentiable Gaussian Splatting),在图像平面上生成稠密且连续的表示,从而避免稀疏支撑结构的坍塌,增强梯度流动性和跨模态连接的学习能力。实验表明,该方法在 PCN 和 ShapeNet-55/34 数据集上达到最优性能,并在 KITTI 真实世界基准测试中验证了其对视觉线索的稳定依赖性,显著优于基线方法。
链接: https://arxiv.org/abs/2605.01466
作者: Zhaoyang Li,Zhichao You,Tianrui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a Spotlight paper at ICML 2026; camera-ready version
Abstract:Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at this https URL.
[CV-152] SRGAN-CKAN: Expressive Super-Resolution with Nonlinear Functional Operators under Minimal Resources
【速读】:该论文旨在解决单图像超分辨率(Single-Image Super-Resolution, SISR)任务中高倍放大时高频细节严重退化的问题,尤其针对现有基于Transformer和扩散模型的方法在提升全局上下文建模能力和感知质量的同时带来计算复杂度显著增加的局限性。其解决方案的关键在于提出一种融合卷积型Kolmogorov–Arnold网络(Convolutional Kolmogorov–Arnold Networks, CKAN)的混合超分辨率框架SRGAN–CKAN,通过将传统线性局部映射替换为基于样条函数的非线性局部变换,以最小的硬件资源实现对复杂局部结构和高频纹理的高表达力建模,从而在受限计算条件下同时提升重建保真度与感知质量。
链接: https://arxiv.org/abs/2605.01459
作者: Roberto Isai Navaro-Aviña,Eduardo Said Merin-Martinez,Andres Mendez-Vazquez,Eduardo Rodriguez-Tello
机构: Cinvestav, Unidad Guadalajara (国家研究与高级技术学院,瓜达拉哈拉分校); Cinvestav, Unidad Tamaulipas (国家研究与高级技术学院,塔毛利帕斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Single-Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a Low-Resolution (LR) observation, a fundamentally ill-posed problem where high-frequency details are severely degraded at large upscaling factors. Recent advances have been driven by transformer-based architectures and diffusion models improve global context modeling and perceptual quality at the cost of increased computational complexity. In contrast, this work focuses on enhancing the expressivity of local operators under minimal resources. We propose SRGAN–CKAN, a hybrid super-resolution framework that integrates Convolutional Kolmogorov–Arnold Networks (CKAN) into an adversarial learning setting reformulating convolution as a nonlinear patch-based transformation. The proposed operator replaces linear local mappings with spline-based functional representations, allowing expressive modeling of complex local structures and high-frequency textures using minimal hardware resources. Experimental results demonstrate that the proposed approach improves perceptual quality while preserving reconstruction fidelity, achieving a favorable balance between distortion-based and perceptual metrics. These results are obtained under constrained computational settings, highlighting the efficiency of the proposed formulation. Overall, this work introduces a complementary direction to existing approaches by improving the representational power of local transformations, providing an efficient and scalable alternative to globally intensive architectures.
[CV-153] Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence CVPR2026
【速读】:该论文旨在解决当前基于学习的三维人脸重建方法(如ToFu和TEMPEH)依赖于耗时的手动注册数据作为训练监督的问题,从而限制了其自动化程度与可扩展性。解决方案的关键在于提出MOCHI(Multi-view Optimizable Correspondence of Heads from Images),该框架通过引入伪线性逆运动学求解器强制拓扑一致性,消除了对已注册训练数据的依赖;同时利用仅在合成数据上训练的2D关键点检测器引导语义对齐,并采用基于点映射(pointmap)和法向量(normal)的损失函数替代传统的点到面距离损失,以提升训练稳定性与重建保真度;此外,在测试阶段引入优化策略对网络权重进行少量迭代微调,实现了前馈效率与迭代精度之间的平衡,显著优于传统人工密集型注册流程。
链接: https://arxiv.org/abs/2605.01450
作者: Panagiotis P. Filntisis,George Retsinas,Radek Daněček,Vanessa Sklyarova,Petros Maragos,Timo Bolkart
机构: Athena Research Center (雅典研究中心); NTUA (国家技术大学); HERON – Hellenic Robotics Center of Excellence (赫利安机器人卓越中心); MPI for Intelligent Systems (马克斯普朗克智能系统研究所); ETH Zurich (苏黎世联邦理工学院); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, CVPR 2026
Abstract:Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: this https URL.
[CV-154] Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation ICML2026
【速读】:该论文旨在解决开放世界机器人操作中的跨任务泛化问题,其核心挑战在于如何从已见任务中提取可迁移的操作知识。现有基于上下文学习的方法仅提供低层次的连续动作序列作为上下文,难以捕捉可组合的技能(skill)知识,导致模型退化为表面轨迹模仿。解决方案的关键在于提出“分解与重构”(Decompose and Recompose)框架,通过将已见任务演示分解为可解释的原子技能-动作对(atomic skill-action pairs)作为中间表示,使模型能够基于组合推理对未见任务进行技能重组与执行顺序规划。具体而言,该方法结合视觉语义检索与规划代理生成的技能序列构建任务自适应动态演示库,并辅以覆盖感知的静态库填补缺失的技能模式,从而生成具备技能全面性的演示数据,显式激发模型的组合推理能力,实现零样本跨任务泛化。
链接: https://arxiv.org/abs/2605.01448
作者: Xitie Zhang,Aming Wu,Yahong Han
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:Cross-task generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill–action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method’s zero-shot cross-task generalization capability.
[CV-155] Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank
【速读】:该论文旨在解决运动预测(motion forecasting)中普遍存在的可解释性与预测准确性之间的权衡问题。传统基于锚点(anchor-based)的架构依赖于难以解释的潜在查询(latent queries),易发生潜在空间坍缩(latent collapse),或采用简单的轨迹采样方式,限制了多模态预测的多样性。其解决方案的关键在于提出一个端到端可微框架,通过对比学习构建一个结构化的“运动库”(motion bank),即物理上可行轨迹的嵌入空间;并设计了一个新颖的锚点检索层(Anchor Retrieval Layer),利用双层门控交叉注意力机制动态检索语义明确的运动先验,并通过Straight-Through Gumbel-Softmax估计器实现离散轨迹选择以保持梯度连续性;最终结合DETR式解码器、WTA动力学高斯混合模型(GMM)、潜在多样性惩罚项和软最小加权终点损失进行联合优化,从而在保证预测多样性的同时显著提升可解释性与多模态精度。
链接: https://arxiv.org/abs/2605.01393
作者: Abhishek Vivekanandan,Ahmed Abouelazm,J. Marius Zöllner
机构: FZI Forschungszentrum Informatik (FZI Forschungszentrum Informatik); Karlsruhe Institute of Technology (KIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Sumitted for PeerReview
Abstract:Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive “motion bank”, a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the “black box” of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: this https URL
[CV-156] VISTA: Video Interaction Spatio-Temporal Analysis Benchmark CVPR2026
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)评估基准在视频理解能力上的局限性,即当前评测主要集中在简单单动作视频、封闭属性集和受限实体类型上,无法刻画现实世界中多样实体间自由形式的多动作交互。此外,缺乏系统性的框架来分析模型在空间与时间轴上的失败模式,导致对模型性能的理解不全面。解决方案的关键在于提出VISTA(Video Interaction Spatio-Temporal Analysis)基准,这是一个面向开放集、多实体和多动作的时空理解评测体系;其核心创新是将视频解构为可解释的实体、关联动作及关系动态,并构建统一的交互感知分类体系,从而实现对关系、空间和时间理解能力的多轴诊断与综合评估,首次提供大规模、交互感知的诊断性评测框架,推动VLM在复杂视频场景中的模型设计与训练策略优化。
链接: https://arxiv.org/abs/2605.01391
作者: Alejandro Aparcedo,Akash Kumar,Aaryan Garg,Dalton Pham,Wen-Kai Chen,Anirudh Bharadwaj,Aman Chadha,Yogesh Rawat
机构: University of Central Florida (中佛罗里达大学); BITS Pilani (比茨理工学院); Ho Chi Minh City University of Science (胡志明市科学大学); Google DeepMind (谷歌深度大脑)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshop on Pixel-level Video Understanding in the Wild (PVUW)
Abstract:Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
[CV-157] Sparse Representation Learning for Vessels
【速读】:该论文旨在解决在临床分辨率下对整个器官级别的血管网络进行高效建模与分析的难题,现有方法受限于小区域或简化树状结构,难以处理全器官尺度的复杂血管拓扑。解决方案的关键在于提出VAEsselSparse——一种基于稀疏卷积和注意力机制的编码器-解码器模型,通过利用3D血管结构的固有稀疏性,在保持亚毫米级分辨率的同时实现8×8×8的空间压缩率,从而获得紧凑且语义丰富的潜在表示空间;该空间不仅支持高精度重建,还保留了可用于分类任务(如动脉瘤/狭窄、Willis环亚型)的临床判别特征,并可作为生成模型学习血管特异性先验的基础,实现真实血管结构的合成。
链接: https://arxiv.org/abs/2605.01382
作者: Chinmay Prabhakar,Bastian Wittmann,Paul Büschl,Hongwei Bran Li,Bjoern Menze,Suprosanna Shit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Analyzing human vasculature and vessel-like, tubular structures, such as airways, is crucial for disease diagnosis and treatment. Current methods often rely on small sub-regions or simplified tree-like structures, rendering analysis of entire organ-level networks at clinical resolution computationally challenging. To this end, we propose VAEsselSparse, an efficient encoder-decoder model to obtain a meaningful yet compact representation of the entire organ-level vascular network at sub-millimeter resolution. VAEsselSparse leverages the inherent sparsity of 3D vascular structures via sparse convolutions and attention mechanisms, achieving substantial spatial compression rates of 8 x 8 x 8. We demonstrate superior reconstruction performance compared to dense counterparts and previous methods. Importantly, the resulting latent space retains clinically relevant discriminative features readily usable for classification tasks, such as aneurysm/stenosis or subvariants of the circle of Willis. Moreover, the compact latent space of VAEsselSparse serves as an effective representation for learning vessel-specific priors through generative models, enabling the synthesis of realistic vasculature.
[CV-158] VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
【速读】:该论文旨在解决开放词汇3D交互能力检测(open-vocabulary 3D affordance detection)中因自回归生成机制导致的空间定位精度不足的问题。现有方法利用多模态大语言模型(MLLMs)输出特殊标记并解码为分割掩码,但这些标记虽语义丰富,却缺乏对空间邻域关系的有效建模,从而限制了3D局部化性能。其解决方案的关键在于提出VoxAfford框架,通过在生成后将来自冻结预训练3D VQVAE编码器的多尺度几何特征注入输出标记中,使每个标记利用自身语义作为查询,通过交叉注意力机制从对应体素尺度中检索相关几何模式,并由可学习的兼容性门控控制注入强度;随后,增强后的标记经语义条件注意力聚合为具有空间感知能力的提示,并与点级特征一同传播以生成最终掩码,显著提升了空间定位准确性和零样本泛化能力。
链接: https://arxiv.org/abs/2605.01365
作者: Haowen Sun,Shaolong Zhang,Mingyang Li,Chengzhong Ma,Xinzhe Chen,Qiongjie Cui,Xingyu Chen,Zeyang Liu,Xuguang Lan
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.
[CV-159] AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification
【速读】:该论文旨在解决在资源受限的田间环境中,如何高效部署具备强大表征能力的视觉Transformer(Vision Transformer, ViT)模型以实现植物叶片病害自动分类的问题。由于ViT计算成本高,难以直接应用于边缘设备,现有方法在将ViT的丰富知识迁移至轻量级模型时效果不佳。解决方案的关键在于提出一种跨架构知识蒸馏框架AgriKD,通过在输出层、特征层和关系层等多个维度引入蒸馏目标,有效弥合Transformer与卷积神经网络(Convolutional Neural Network, CNN)之间的表征差异,使轻量级CNN学生模型能够更好地保留并利用来自ViT教师模型的全局语义信息,从而在显著降低参数量(约172倍)、计算成本(约47.57倍)和推理延迟(18–22倍)的同时,保持与教师模型相当的分类性能。
链接: https://arxiv.org/abs/2605.01355
作者: Minh-Dung Le,Minh-Duc Hoang,Hoang-Vu Truong,Thi-Thu-Hong Phan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 47 pages, 14 figures
Abstract:Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.
[CV-160] CHASE: Competing Hypotheses for Ambiguity-Aware Selective Prediction
【速读】:该论文旨在解决在局部观测不完整(partial observability)场景下,传统选择性预测方法因依赖单一预测分支的置信度分数而失效的问题。这类场景中,时间序列上的局部证据可能相互矛盾,导致标准置信度指标误导决策。解决方案的关键在于提出CHASE(Competing Hypotheses for Ambiguity-Aware Selective Prediction)框架,通过显式比较结构化的时间解释(structured temporal explanations)来判断是否做出决策或放弃预测;其核心机制是利用竞争假设之间的得分差距(hypothesis margins)构建一个排名感知的选择器(ranking-aware selector),从而在全球范围内区分可安全承诺的决策与本质不确定的情形,实现对结构化模糊性的有效建模和鲁棒决策。
链接: https://arxiv.org/abs/2605.01346
作者: Kartik Jhawar,Yuhao Geng,Atul N. Parikh,Lipo Wang
机构: Institute for Digital Molecular Analytics and Science, NTU, Singapore 636921; School of Electrical and Electronic Engineering, NTU, Singapore 639798; School of Materials Science and Engineering, NTU, Singapore 639798; Singapore Centre for Environmental Life Sciences Engineering, NTU, Singapore 637551
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Standard selective prediction methods typically estimate uncertainty from the output of a single predictive branch. While effective for general uncertainty estimation, these approaches often struggle under partial observability, where local temporal evidence can be contradictory and standard confidence scores become misleading. We introduce CHASE (Competing Hypotheses for Ambiguity-Aware Selective Prediction), a selective prediction framework that explicitly compares structured temporal explanations to determine whether to commit to a decision or abstain. Because genuine ambiguity causes the score gap between competing hypotheses to collapse, CHASE optimizes a ranking-aware selector over these hypothesis margins to globally separate safe commitments from fundamentally uncertain ones. We evaluate this framework on the problem of hidden connectivity inference, utilizing a controlled, physically grounded simulator inspired by the dynamics of giant unilamellar vesicles (GUVs), alongside zero-shot qualitative transfer (without retraining or fine tuning) to representative real GUV videos. Our experiments demonstrate that explicitly reasoning over competing hypotheses provides a superior balance of metrics. Compared to canonical uncertainty baselines, CHASE achieves statistically significant gains in overall no-abstain accuracy, three-way accuracy, and overall ambiguity-aligned abstention (at 80% coverage). Specifically, it yields up to an 11.0% relative mean improvement in overall alignment, alongside up to an 8.8% relative boost in three-way accuracy in the very-high ambiguity regime. By maintaining a selective risk boundary strictly at par with the best baselines at 80% coverage, and reducing overall risk by 9.9% at 90% coverage, this framework offers a more reliable approach to decision-making under structured ambiguity.
[CV-161] Active Reasoning Vision-Language Models via Sequential Experimental Design ICML2026
【速读】:该论文旨在解决现代视觉-语言模型(Vision-Language Models, VLMs)中因感知带宽瓶颈导致的细粒度信息丢失问题,即广视野视角不可避免地牺牲了复杂推理所需的精细细节。解决方案的关键在于将克服这一限制建模为一个顺序决策过程,并通过顺序贝叶斯最优实验设计(Sequential Bayesian Optimal Experimental Design, S-BOED)进行形式化。作者提出了一种在连续千兆像素空间中可计算的近似方法,在空间覆盖与分辨率之间取得平衡;并设计了一种无需训练的推理策略作为S-BOED目标的具体实现,该策略兼容多种优化算法(如贪婪采样或前瞻规划),从而有效提升模型在千兆像素级基准上的性能,显著优于标准基线并缩小与人工标注“黄金标准”之间的差距。
链接: https://arxiv.org/abs/2605.01345
作者: Anjie Liu,Ziqin Gong,Yan Song,Yuxiang Chen,Xiaolong Liu,Hengtong Lu,Kaike Zhang,Chen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 5 figures, accepted at ICML 2026
Abstract:Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspired by the classical paradigms of active vision and information foraging, we frame overcoming this limitation as a sequential decision-making process. We formalise this process through the lens of the sequential Bayesian optimal experimental design (S-BOED) problem. While exact Bayesian inference is intractable in continuous gigapixel spaces, we derive principled yet tractable approximations that balance spatial coverage against resolution. To validate this framework, we present a training-free inference strategy as a practical instantiation of the S-BOED objective for agents equipped with multiple vision tools. Designed as a flexible template, this strategy accommodates arbitrary optimisation algorithms, ranging from efficient greedy sampling to look-ahead planning, to approximate the optimal design. Empirical evaluations on gigapixel-level benchmarks demonstrate that our approach further boosts the performance of state-of-the-art models, significantly outperforming standard baselines and effectively narrowing the gap towards human-annotated oracles.
[CV-162] Zero-Shot Interpretable Image Steganalysis for Invertible Image Hiding
【速读】:该论文旨在解决现有图像隐写分析(image steganalysis)方法在应对新兴可逆图像隐藏(invertible image hiding)方案时的局限性,尤其是其在零样本(zero-shot)场景下泛化能力不足的问题。传统方法通常仅将图像分为“隐写”或“载体”两类,且要求训练与测试数据分布一致,难以适用于真实世界复杂多变的应用环境。解决方案的关键在于提出一种可解释的统一框架,将图像隐藏、还原与隐写分析集成一体,并赋予分析模块恢复嵌入秘密信息的能力;同时设计了一种简单而有效的残差增强策略用于生成隐写图像,显著提升了模型在跨数据集和跨架构场景下的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2605.01331
作者: Hao Wang,Yiming Yao,Yaguang Xie,Tong Qiao,Zhidong Zhao
机构: Hangzhou Dianzi University (杭州电子科技大学); Arcvideo Technology (弧视科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE SPL
Abstract:Image steganalysis, which aims at detecting secret information concealed within images, has become a critical countermeasure for assessing the security of steganography methods, especially the emerging invertible image hiding approaches. However, prior studies merely classify input images into two categories (i.e., stego or cover) and typically conduct steganalysis under the constraint that training and testing data must follow similar distribution, thereby hindering their application in real-world scenarios. To overcome these shortcomings, we propose a novel interpretable image steganalysis framework tailored for invertible image hiding schemes under a challenging zero-shot setting. Specifically, we integrate image hiding, revealing, and steganalysis into a unified framework, endowing the steganalysis component with the capability to recover the secret information embedded in stego images. Additionally, we elaborate a simple yet effective residual augmentation strategy for generating stego images to further enhance the generalizability of the steganalyzer in cross-dataset and cross-architecture scenarios. Extensive experiments on benchmark datasets demonstrate that our proposed approach significantly outperforms the existing steganalysis techniques for invertible image hiding schemes.
[CV-163] Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
【速读】:该论文旨在解决低比特量化(low-bit quantization)在视觉 Transformer(vision Transformer)部署中因激活异常值(activation outliers)导致的性能下降问题。现有方法要么在训练后进行量化处理,要么在训练过程中抑制大激活值,但这种激进的限制会破坏全精度与量化精度之间的平衡。论文提出的关键解决方案是引入一种结构正则化方法——共线性衰减(Colinearity-Decay, CD),其核心思想并非直接压制异常值,而是通过惩罚 Transformer 块内有序矩阵对之间的有害交叉对齐(detrimental cross-matrix alignment),从而控制异常值的结构性放大效应。CD 作为解耦更新机制,不改变模型架构或任务损失函数,具有非侵入性和极低的训练开销。实验表明,CD 在 ImageNet-1K 预训练、COCO 检测及下游微调等多个场景下均显著提升量化准确率,同时保持甚至改善全精度性能,证明了结构正则化在低比特部署中的有效性。
链接: https://arxiv.org/abs/2605.01330
作者: Jin Tong,Guang Liang,Peilin Sun,Jianxin Wu
机构: Nanjing University (南京大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures
Abstract:Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.
[CV-164] Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中视觉编码器(vision encoder)选择缺乏理论指导的问题,即如何系统性地判断一个预训练视觉编码器是否适合用于VLM对齐。此前常见的做法如选择参数量最大或零样本准确率最高的视觉编码器,未能有效预测最终VLM性能,且相关性较弱。论文的关键解决方案在于提出以跨模态结构相似性为核心指标,并引入Gromov-Wasserstein距离作为无监督、推理阶段即可计算的代理指标,理论上证明该距离与跨模态映射的学习能力存在可证关联,实验证明其在60余次完整VLM训练中显著优于现有策略,展现出更强的预测能力,从而实现高效、可靠的视觉编码器筛选。
链接: https://arxiv.org/abs/2605.01325
作者: Muyang Li,Yucheng Liu,Jianbo Ma,Elliot Osborne,Bo Han,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as Highlight publication for CVPR 2026
Abstract:Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
[CV-165] Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLM s CVPR2026
【速读】:该论文旨在解决轻量级多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于强化学习(Reinforcement Learning, RL)微调过程中因数据偏差导致的感知偏差(Perceptual Bias)问题,即模型倾向于依赖数据中固有的感知捷径(Perceptual Shortcuts),而非发展出真正的推理能力。解决方案的关键在于提出 VideoThinker 框架,其核心是通过两阶段去偏机制实现:首先在“感知意识训练”阶段构建一个专门表征捷径行为的“偏见模型”(Bias Model);随后在“因果去偏策略优化”(Causal Debiasing Policy Optimization, CDPO)阶段,利用创新的排斥目标函数,主动将主模型从偏见模型的错误逻辑中推开,同时拉向正确且泛化的推理路径,从而显著提升轻量模型的视频推理性能。
链接: https://arxiv.org/abs/2605.01324
作者: Jingze Wu,Quan Zhang,Hongfei Suo,Zeqiang Cai,Hongbo Chen
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Poster
Abstract:Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge this http URL address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning this http URL by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated “bias model” to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model’s flawed logic while simultaneously pulling it toward correct, generalizable this http URL model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at this https URL.
[CV-166] PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression
【速读】:该论文旨在解决LiDAR点云压缩中因基于八叉树(octree)结构的 learned entropy modeling 所导致的两个关键瓶颈:一是由于因果性、多阶段上下文建模带来的解码延迟过高;二是性能与延迟之间存在刚性权衡,使得单一模型难以适应不同应用场景的约束。解决方案的关键在于提出PACE框架,将祖先上下文聚合重构为非因果主干网络,并将因果性限制在轻量级、可扩展的预测器中,从而避免重复执行主干网络,显著降低计算开销;该预测器支持任意数量的预测阶段,可在不重新加载参数的情况下实现灵活的性能-延迟权衡,从而在保持高压缩效率的同时将自回归模式下的解码延迟降低90%以上。
链接: https://arxiv.org/abs/2605.01320
作者: Jiahao Zhu,Kang You,Dandan Ding,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, supporting seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90% in autoregressive mode, highly attractive for practical applications.
[CV-167] CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning CVPR2026
【速读】:该论文旨在解决长尾分布(long-tailed distribution)下类别间概念混淆问题,即在单标签监督(single-label supervision)机制中,由于尾部类别样本稀缺导致特征共享受抑制、头部类别主导性增强,从而破坏类间可分性(inter-class discriminability)。其解决方案的关键在于提出CUE(Concept-aware mUlti-label Expansion),通过引入多标签概念信号来恢复被破坏的类间关系:具体而言,CUE利用零样本CLIP提取实例级视觉线索和大语言模型(LLM)生成类别级语义线索,构建概念集,并通过分别加权的二元逻辑调整(Binary Logit-Adjustment, BLA)辅助损失与基础逻辑调整(Logit-Adjustment, LA)损失联合优化,实现对长尾分布下类间关系的显式建模与增强。
链接: https://arxiv.org/abs/2605.01309
作者: Ruichi Zhang,Chikai Shang,Jiacheng Yang,Mengke Li,Yang Zhou,Junlong Gao,Yang Lu
机构: Xiamen University (厦门大学); Shenzhen University (深圳大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages. Accepted by CVPR 2026
Abstract:Long-tailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminability. To address this, we propose CUE, Concept-aware mUlti-label Expansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by (i) extracting instance-level visual cues from zero-shot CLIP and (ii) generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustment (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustment (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. Code is available at: this https URL.
[CV-168] Checkerboard: A Simple Effective Efficient and Learning-free Clean Label Backdoor Attack with Low Poisoning Budget
【速读】:该论文旨在解决清洁标签后门攻击(clean-label backdoor attack)中普遍存在的效率低、依赖复杂优化或额外数据访问等问题。现有方法通常需要昂贵的优化过程、训练代理模型或对数据集有非平凡的访问权限,限制了其实际可行性与隐蔽性。解决方案的关键在于提出一种无需学习(learning-free)的理论驱动型攻击方法 Checkerboard:首先基于线性可分性理论推导出闭式解形式的棋盘状触发器(checkerboard trigger),从而省去传统攻击中所需的代理模型训练和触发器优化步骤;其次针对纹理丰富数据集引入**复杂度驱动样本选择(Complexity-driven Sample Selection)**策略,仅使用目标类图像通过选择低复杂度样本提升触发器与背景的对比度,增强攻击效果。实验表明,该方法在多个基准数据集上均优于8种基线攻击,在极低污染预算下实现高成功率(如CIFAR-10上仅用20个样本即可达到99.99%攻击成功率达),且具备对抗先进防御机制的能力。
链接: https://arxiv.org/abs/2605.01298
作者: Yi Yang,Jinyang Huang,Binbin Liu,Feng-Qi Cui,Xiaokang Zhou,Zhi Liu,Jie Zhang,Meng Li
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学); Kansai University (关西大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心); The University of Electro-Communications (电波通信大学); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Backdoor attacks threaten the deep learning supply chain by poisoning a small fraction of the training data so that a model behaves normally on clean inputs but misclassifies trigger-carrying inputs to an attacker-chosen target class. Clean-label backdoor attacks are especially dangerous because poisoned samples remain label-consistent and are therefore harder to detect. Yet existing clean-label attacks typically rely on expensive optimization, surrogate-model training, or nontrivial data access. We present Checkerboard, a theoretically grounded, learning-free clean-label backdoor attack that is effective, efficient, and simple to implement. From a linear separability formulation, we derive a checkerboard trigger in closed form, removing the need for surrogate-model training and trigger optimization. For texture-rich datasets, we introduce Complexity-driven Sample Selection, which uses only target-class data to improve trigger-to-background contrast by selecting low-complexity images for poisoning. Across four benchmark datasets, Checkerboard outperforms 8 baseline attacks and achieves state-of-the-art performance under low poisoning budgets. For example, on CIFAR-10, under a trigger perturbation budget of 10/255 , poisoning 20 training samples achieves 99.99% Attack Success Rate (ASR). On ImageNet-100, a poisoning rate of only 0.46% yields over 94% ASR without degrading clean accuracy. The proposed attack also remains effective against state-of-the-art backdoor defenses and shows strong resistance to adaptive defenses. Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68 Cite as: arXiv:2605.01298 [cs.CR] (or arXiv:2605.01298v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.01298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-169] SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On ICPR2026
【速读】:该论文旨在解决基于扩散模型(diffusion-based)的虚拟试衣(virtual try-on)方法在细节保留上的不足,特别是文本和图案等精细特征因依赖隐式空间对应学习而易丢失的问题。解决方案的关键在于引入SIFT关键点匹配(SIFT keypoint matching)以提供显式的几何引导:通过领域特定的过滤策略提取服装与人体图像间的可靠关键点对应关系,并将其转化为空间概率分布,用于监督训练过程中交叉注意力(cross-attention)层的学习。这一显式监督机制促使模型聚焦于几何一致的服装区域,从而显著提升文本清晰度和图案对齐效果,同时保持配对重建指标的竞争力。
链接: https://arxiv.org/abs/2605.01296
作者: Kosuke Takemoto,Takafumi Koshinaka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR2026
Abstract:Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset demonstrate significant improvements on unpaired metrics while maintaining competitive paired reconstruction metrics. Qualitative comparisons show superior preservation of text clarity and pattern alignment. Attention visualizations confirm that our method produces sharply focused attention on relevant garment details. This work demonstrates that classical geometric correspondence methods can effectively enhance modern diffusion models for conditional synthesis tasks. The source code will be available at this https URL.
[CV-170] Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
【速读】:该论文旨在解决植物叶片病害分类中因数据集不足与模型训练效率低导致的识别准确率不高、泛化能力弱的问题。其关键解决方案在于:首先系统梳理并构建了适用于植物叶片病害分类的开放数据集,结合增强技术适用性研究优化数据质量;随后基于DenseNet201架构设计了一种新基础模型(Base Model),并通过迁移学习(Transfer-Learning, TL)策略在多个数据集上验证其优越性——该模型在训练速度、鲁棒性、稳定性及数据需求量方面均优于通用模型,显著提升了小样本场景下的性能表现。
链接: https://arxiv.org/abs/2605.01283
作者: David J. Richter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s thesis
Abstract:Plants, crops and their yields are essential to our very existence, but diseases and pests cause large losses every year. As such it is vital to ensure that diseases can be spotted early and treated accordingly and stopping the spread while still possible. Manual and traditional methods require personal to walk through the field and check for symptoms ‘by hand’. This is very laborious and very time consuming, so ML methods have been applied as a result and they have garnered promising results. CNN models are especially efficient as they can automatically extract features from images without any manual feature construction before then feeding the features to a classifier. Datasets are largely influential to the final performance of the model. Despite the importance that datasets pose to the field, there still seems to be somewhat of a discrepancy between what is publicly available for use and what would be required to sufficiently train fully capable models. To overcome these shortcomings, as part of this thesis open datasets for the field of plant leaf disease classification have been identified as well as models that can be trained on them and extensive benchmarks have been carried out to identify their suitability. Then a new dataset was constructed based on those findings as well as on the findings of a augmentation applicability study, which will be used to train a new Base Model based on the DenseNet201 architecture, which managed to outperform the baseline model on said new dataset as well as outperforming it on plant leaf disease classification domain specific Transfer-Learning experiments on another new dataset. This new model manages to train models through Transfer-Learning (TL) faster, more robust, more stable, and with less data than general model would, overcoming a large number of issues that the field still suffers from.
[CV-171] CNN-based Multi-In-Multi-Out Model for Efficient Spatiotemporal Prediction
【速读】:该论文旨在解决基于卷积神经网络(CNN)和Transformer架构的时空预测模型中存在的局限性问题:CNN模型因卷积核的局部感受野难以捕捉全局信息,且在处理时序数据时将时间轴与通道轴混合导致信息混淆;而Transformer模型则因自注意力机制计算复杂度高、训练耗时长。其解决方案的关键在于提出一种新型结构——基于CNN的多输入多输出高效时空预测模型(MIMO-ESP),该模型通过结合CNN与Transformer的优势,在保持低复杂度的同时引入全局建模能力;同时将时间轴作为独立维度处理,并利用膨胀卷积(dilation)有效融合时空特征,从而实现高效且高性能的时空预测。
链接: https://arxiv.org/abs/2605.01277
作者: Hyeonseok Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Master’s thesis
Abstract:Recently, Convolutional Neural Network (CNN) or Transformer architecture based models have been proposed to overcome the limitations of Recurrent Neural Network (RNN) based models in spatiotemporal prediction. These models prevent the inefficiency of parallelization limitation due to the sequential properties and stacked error due to the recursive method, and show high performance. Novertheless, there are still some challengies. First, CNN based models have difficulty considering global information due to the local properties of the kernel, and their performance is limited. In addition, information is mixed because the time axis is combined with the channel axis of the image for processing. Models based on Transformer architecture have high complexity due to the self-attention calcuation and take a long training time. In this paper, we propose a new structure model called CNN-based Multi-In-Multi-Out model for Efficient Spatiotemporal Prediction (MIMO-ESP) to overcome these limitations. MIMO-ESP considers global information and significantly improves complexity by configuring a Transformer architecture based on CNN. In addition, it treats the time axis as an independent axis without combining it, and effectively considers spatiotemporal information together by applying dilation. This structure makes MIMO-ESP efficient and high performance. Extensive experiment results on three promising benchmark datasets which including video, traffic, and precipitation prediction tasks demonstrate that the usefulness of MIMO-ESP due to the achieved competitive efficiency while outperforming existing models. Furthermore, the ablation study results demonstrate the usefulness of the components of MIMO-ESP, emphasizing the potential of the proposed approaches.
[CV-172] GameScope: A Multi-Attribute Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
【速读】:该论文旨在解决当前游戏视频质量评估模型在不同编码格式(codec)下难以保持一致性的问题,因为现有主观质量数据集数量有限且多样性不足。其解决方案的关键在于构建迄今为止规模最大、涵盖最广的游戏视频质量数据集,该数据集融合了用户生成内容(UGC)与专业生成内容(PGC),覆盖H.264、H.265和AV1三种主流编码格式,包含4,048个视频样本,并为每个样本提供平均37次的平均意见分数(MOS)标注;此外,还引入细粒度的质量属性标签,以更深入理解感知因素。实验表明,基于该数据集训练的视觉语言模型在性能上优于所有基准方法,标志着首个跨编码格式与内容类型、具备质量属性标注的综合性游戏视频质量评估数据集的建立。
链接: https://arxiv.org/abs/2605.01272
作者: Rajesh Sureddi,Shreshth Saini,Avinab Saha,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The development of video game streaming has grown rapidly, with major platforms such as YouTube and Twitch using different codecs. To support quality assessment models that work consistently across any codec, it is necessary to have access to large, diverse subjective gaming quality datasets. Currently, there are only a few available, each having limitations. To address this gap, we present the largest gaming video quality dataset to date, incorporating both user-generated content (UGC) and professional-generated content (PGC) with extensive visual diversity. Our dataset covers the most widely used codecs - H.264, H.265, and AV1 - and consists of 4,048 video samples, each annotated by an average of 37 mean opinion score (MOS) ratings. In addition to overall quality scores, we collect coarse-grained quality attributes, enabling a better understanding of perceptual factors. We study the performance of leading video quality assessment methods on this dataset, including a vision language model that outperforms all the benchmarks. To the best of our knowledge, this is the first dataset that comprehensively addresses gaming video quality assessment across multiple codecs and content types with quality attributes. Our dataset is publicly available at this https URL.
[CV-173] Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation
【速读】:该论文旨在解决零样本视觉语言模型(Zero-shot Vision-Language Models, VLMs)在非小细胞肺癌(Non-Small-Cell Lung Cancer, NSCLC)肿瘤靶区勾画(Gross Tumor Volume, GTV delineation)任务中,其空间注意力机制受哪些提示维度(prompt dimensions)调控的问题。现有方法缺乏对提示结构如何影响模型空间行为的理解,限制了其可解释性和可靠性。解决方案的关键在于系统性地分解提示为诊断、人口统计学、分期、解剖位置、通用描述和无关控制等子提示模块,并通过属性扰动鲁棒性测试、特异性梯度分析以及跨病例提示交换实验,揭示解剖位置是主导空间注意力的核心因素(63.4%的位置扰动导致严重性能下降),而病理类型与分期的影响微弱;在此基础上,VoxTell模型在完全零样本条件下实现了与微调模型(如nnUNet)相当的Dice相似系数(DSC=0.613 vs. 0.690, p=0.156),验证了提示维度分析对于评估分割型VLMs的重要性——即不仅应关注Dice指标,更需考察模型对提示各维度的对齐行为。
链接: https://arxiv.org/abs/2605.01266
作者: Suraj Pai,Thibault Heintz,Cosmin Ciausu,Marion Tonneau,Hugo Aerts,Raymond Mak
机构: Mass General Brigham (麻省总医院); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell’s spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes “where to look” over “what to look for.” In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.
[CV-174] Degradation-Aware Adaptive Context Gating for Unified Image Restoration
【速读】:该论文旨在解决统一图像恢复(Unified Image Restoration)中因多样化退化类型导致的任务干扰问题。其核心挑战在于如何在单一模型中有效区分并适应不同退化特征,从而避免噪声传播和结构信息损失。解决方案的关键是提出DACG-IR(Degradation-Aware Adaptive Context Gating)框架,通过构建退化感知的上下文表示来动态调节特征表示:首先设计轻量级多尺度退化感知模块提取粗粒度退化信息并生成逐层提示(layer-wise prompts),用以引导编码器和解码器中注意力温度与输出门控机制;其次引入空间-通道双门控自适应融合机制,抑制浅层噪声向深层传播,从而在保留重要结构信息的同时有效抑制退化引起的噪声。
链接: https://arxiv.org/abs/2605.01236
作者: Lei He,Jielei Chu,Fengmao Lv,Weide Liu,Tianrui Li,Jun Cheng,Yuming Fang
机构: Southwest Jiaotong University (西南交通大学); Institute for Infocomm Research, Agency for Science, Technology and Research (新加坡科技研究局信息通信研究所); Jiangxi University of Finance and Economics (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified image restoration using a single model often faces task interference due to diverse degradations. To address this, we propose DACG-IR (Degradation-Aware Adaptive Context Gating), which enables explicit perception of degradation characteristics to dynamically modulate feature representations. Our method constructs degradation-aware contextual representations from the input to modulate attention distribution, frequency-domain features, and feature aggregation. Specifically, a lightweight multi-scale degradation-aware module extracts coarse degradation information and generates layer-wise prompts. These prompts guide attention temperature and output gating in encoder and decoder blocks for adaptive feature extraction. Additionally, a spatial-channel dual-gated adaptive fusion mechanism refines encoder features, suppressing noise propagation from shallow to deep layers. This design effectively suppresses degradation-induced noise while preserving informative structures. Experiments show DACG-IR outperforms state-of-the-art methods in single-task, all-in-one, adverse weather removal, and composite degradation settings. Code: this https URL
[CV-175] 4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos
【速读】:该论文旨在解决从单目广播视频中高保真重建乒乓球比赛轨迹的难题,尤其是现有方法在遮挡和多视角变化下难以实现可靠的时间分割与三维重建的问题。解决方案的关键在于提出一种“先提升后分割”的新范式:首先利用学习得到的提升网络将未分割的二维球轨迹整体映射为三维轨迹,从而克服因遮挡和相机视角变化导致的2D时间分割失效问题;随后基于稳定的3D轨迹进行精确的时间分段,并进一步推断球的旋转、处理不可靠检测点,最终实现高精度的三维球轨迹和人体网格重建。此设计使该方法成为唯一能从通用视角单目广播视频中重建乒乓球比赛数据的方法。
链接: https://arxiv.org/abs/2605.01234
作者: Nima Rahmanian,Daniel Kienzle,Thomas Gossard,Dvij Kalaria,Rainer Lienhart,Shankar Sastry
机构: University of California, Berkeley (加州大学伯克利分校); University of Augsburg (奥格斯堡大学); University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides 140+ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset’s combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball’s spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset’s fidelity through two downstream tasks: estimating the racket’s pose \ velocity at impact, and training a generative model of competitive rallies.
[CV-176] Visual Implicit Autoregressive Modeling ICML2026
【速读】:该论文旨在解决视觉自回归建模(Visual Autoregressive Modeling, VAR)中因显式深层堆叠导致的计算量固定、高分辨率下内存消耗过大等问题。其解决方案的关键在于提出视觉隐式自回归建模(Visual Implicit Autoregressive Modeling, VIAR),通过在浅层预/后处理模块之间嵌入一个隐式平衡层(implicit equilibrium layer),并采用无雅可比反向传播(Jacobian-Free Backpropagation)进行训练,从而实现训练时内存恒定;同时在推理阶段引入每尺度迭代控制参数(per-scale iteration knob),使模型能够在不重新训练的情况下灵活调节计算量与内存占用,兼顾生成质量与部署效率。
链接: https://arxiv.org/abs/2605.01220
作者: Pengfei Jiang,Jixiang Luo,Luxi Lin,Zhaohong Huang,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian-Free Backpropagation, yielding constant training memory, while inference exposes a per-scale iteration knob that enables compute control. On ImageNet 256x256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class-conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per-scale compute control for practical, deployable visual generation.
[CV-177] Multimodal Confidence Modeling in Audio-Visual Quality Assessment ICIP2026
【速读】:该论文旨在解决现有音频-视觉质量评估(Audio-Visual Quality Assessment, AVQA)方法在真实流媒体场景中因模态失真不对称而导致的性能下降问题。具体而言,当某一模态(如视频或音频)严重退化而另一模态保持清晰时,传统AVQA指标由于假设两个模态可靠性相同,采用无信心感知的融合策略,容易过度依赖不可靠信号,从而影响评估准确性。解决方案的关键在于提出MCM-AVQA框架,其核心创新是引入模态特定的置信度估计机制,并通过一个基于置信度引导的音频-视觉混合器(Audio-Visual Mixer)实现跨模态注意力控制。该混合器利用帧级置信度指导的通道注意力门控机制,在特征融合过程中动态调节不同模态间的交互强度,使高置信度模态主导融合结果,同时抑制低置信度输入,从而保留时间维度上的退化模式并提升与人类主观评分的一致性。
链接: https://arxiv.org/abs/2605.01219
作者: Mayesha Maliha R. Mithila,Mylene C.Q. Farias
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Image and Video Processing (eess.IV)
备注: Accepted at ICIP 2026, 6 pages, 4 figures, no supplementary material
Abstract:Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.
[CV-178] Asymmetric Invertible Threat: Learning Reversible Privacy Defense for Face Recognition
【速读】:该论文旨在解决人脸识别(Face Recognition)系统中因未经授权收集和滥用面部数据而引发的隐私保护问题,尤其针对现有基于输入空间扰动的对抗性隐私保护方法在面对对手学习到恢复或净化映射时防护效果下降的问题。其关键解决方案是提出一种不对称可逆人脸保护机制(Asymmetric Reversible Face Protection, ARFP),通过引入密钥条件流形绑定(Key-Conditioned Manifold Binding)、对抗性恢复感知训练(Adversarial Restoration-Aware Training)以及授权可逆恢复(Authorized Reversible Restoration)三个核心组件,实现了隐私保护、密钥驱动的数据恢复与篡改指示的一体化设计,从而在保障合法用户恢复能力的同时,增强对逆向净化攻击的鲁棒性。
链接: https://arxiv.org/abs/2605.01217
作者: Jiabei Zhang,Ziyuan Yang,Andrew Beng Jin Teoh,Yi Zhang
机构: Sichuan University (四川大学); Nanyang Technological University (南洋理工大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face Recognition systems are widely deployed in real-world applications, but they also raise privacy concerns due to unauthorized collection and misuse of facial data. Existing adversarial privacy protection methods rely on input-space perturbations to obfuscate identity information, yet their protection can degrade when adversaries learn restoration or purification mappings that partially invert the transformation. We study this setting as an asymmetric adversarial attack, in which reverse manipulation becomes feasible because existing defense paradigms do not control reversibility. To address this problem, we propose Asymmetric Reversible Face Protection (ARFP), a restoration-aware extension of personalized face cloaking that integrates privacy protection, keyed recovery, and tamper indication in a single framework. ARFP consists of three components: Key-Conditioned Manifold Binding, which ties the protection transformation to a user-provided key; Adversarial Restoration-Aware Training, which introduces a surrogate restoration adversary during training to improve robustness against evaluated inverse purification attacks; and Authorized Reversible Restoration, which supports recovery with the correct key while providing nonce-based tamper indication. Extensive experiments under the threat models considered in this work show that ARFP improves resistance to the evaluated restoration attacks while preserving authorized recovery utility. These results provide empirical evidence of key-sensitive recovery behavior and tamper awareness in the tested settings.
[CV-179] Phase-map synthesis from magnitude-only MR images using conditional score-based diffusion models with application in training of accelerated MRI reconstruction models
【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的加速磁共振成像(Accelerated Magnetic Resonance Imaging, MRI)中训练数据稀缺的问题。由于临床实践中通常仅保存幅度图像(magnitude-only images),而舍弃原始k空间数据(raw k-space data),导致可用于训练通用DL重建模型的大规模多样化数据集难以获取。为此,作者提出一种基于条件得分引导扩散模型(conditional score-based diffusion models, SBDMs)的生成式解决方案:通过给定幅度图像,合成与其在图像域中合理对应的相位图(phase map),从而构建完整的k空间数据用于训练。该方案的关键在于利用SBDM从现有大规模匿名幅度图像数据库中高效生成高质量、物理一致的相位信息,进而生成可用于训练的k空间数据集,显著提升下游DL重建模型的性能和诊断可靠性。
链接: https://arxiv.org/abs/2605.01185
作者: M. Berk Sahin,Dilek Yalcinkaya,Abolfazl Hashemi,Behzad Sharif
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accelerated magnetic resonance imaging (MRI) enabled by the training of deep learning (DL)-based image recon. models requires large and diverse raw k-space datasets. In most clinical MRI applications, due to storage and patient privacy concerns, raw k-space data is discarded and magnitude-only images are the only component saved. Consequently, a large portion of the DL-based MRI recon. literature has either relied on small training datasets or has used one of the few available open-source k-space datasets. At the same time, the growing number of anonymized magnitude-only image registries/databases motivates the development of techniques that can use them as training datasets for generalizable DL-based recon. models. Here we propose to address this challenge by employing a generative approach based on conditional score-based diffusion models (SBDMs): given a magnitude-only MR image, it synthesizes a phase map (in the image domain) that realistically corresponds to the magnitude-only image. We evaluate its generative capabilities in a downstream DL-based recon. task whereby a large k-space dataset is generated by combining the SBDM-synthesized phase-maps and the corresponding magnitude-only images, and this k-space dataset is then used to train a DL model for accelerated MRI recon. We compare the performance of the resulting DL model versus those trained according to (a) a naive approach that uses smooth phase, (b) a k-space training dataset generated using synthesized phase maps derived from a generative adversarial network, and © the ground truth k-space data. Our results suggest that the DL model trained from SBDM-synthesized k-space data outperforms the other approaches in terms of quantitative metrics as well as qualitatively observed recon. fidelity, i.e., whether the reconstructed images include erroneous or hallucinated features that could adversely impact diagnostic accuracy.
[CV-180] CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization
【速读】:该论文旨在解决从几何输入(如网格或点云)中恢复可编辑的参数化CAD构造序列这一关键挑战,现有方法通常受限于难以编辑的格式(如网格或B-rep)或仅支持简单草图-拉伸流程且适用于低复杂度数据集。其解决方案的核心在于提出一种基于优化的混合框架CADFit,通过增量式拟合与验证参数化操作,并利用几何反馈驱动IoU(交并比)优化,从而从网格中重建出复杂的、可编辑的CAD构造序列。该方法创新性地将重建任务建模为结构化CAD程序上的优化问题,并支持包括拉伸、旋转、倒圆和倒角在内的丰富操作,显著提升了重建精度与有效性,尤其在复杂设计场景下大幅降低了无效CAD程序的比例。
链接: https://arxiv.org/abs/2605.01171
作者: Ghadi Nehme,Eamon Whalen,Faez Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering.
[CV-181] CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition ICPR
【速读】:该论文旨在解决零样本动作识别(Zero-Shot Action Recognition, ZSAR)中的两个核心问题:语义鸿沟(semantic gap)和域偏移(domain shift)。语义鸿沟源于文本模态(如语言模型)与视觉模态(如CNN、Transformer等)之间的表征差异,而域偏移则由于训练集与测试集分布不一致导致。为应对这两个挑战,论文提出了一种基于对比学习的新方法,其关键在于构建一个联合嵌入空间(joint embedding space),将视频和自然语言描述映射到同一空间中进行对齐训练;同时设计了自动负采样机制以生成未配对数据(即视觉外观与无关描述的组合),从而增强模型的泛化能力。该方案在UCF-101和Kinetics-400数据集上实现了当前最优性能。
链接: https://arxiv.org/abs/2605.01165
作者: Valter Estevam,Rayson Laroca,Helio Pedrini,David Menotti
机构: Pontifícia Universidade Católica do Paraná (天主教帕拉纳联邦大学); Fundação Araucária (阿劳卡里亚基金会); PROEX-IFPR (IFPR卓越计划)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the International Conference on Pattern Recognition (ICPR) 2026
Abstract:This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at this https URL.
[CV-182] Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation
【速读】:该论文旨在解决病理图像分析中生成式AI(Generative AI)模型缺乏临床基础的问题,即现有模型虽能流畅生成病理报告,但难以准确反映病理学家所关注的关键诊断概念及其相互关系。其核心挑战在于如何整合从细胞级细微结构到整体组织架构的多尺度视觉证据,并保持可解释性与临床一致性。解决方案的关键是提出SCOUT框架——一种语义上下文感知的模态融合Transformer,通过逐步引入全局切片信息和显式诊断概念对图像表示进行条件化建模,实现局部组织学特征、全切片上下文与专家标注语义描述的统一学习,从而在文本生成过程中动态优化视觉特征表达,确保报告的临床合理性与多尺度互补性。
链接: https://arxiv.org/abs/2605.01144
作者: Suryakant Singh,Saarthak Kapse,Joel Saltz,Prateek Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.
[CV-183] ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text
【速读】:该论文旨在解决生成式图像编辑中用户难以同时实现精确空间布局与具体语义细节控制的问题。当前模型在处理自然语言指令时虽能表达纹理、颜色等高层语义,但缺乏空间精度;而自由手绘涂鸦虽提供粗略的空间边界,却无法传达详细的视觉属性。为此,作者提出ScribbleEdit——一个大规模合成数据集,通过自动化的图像修复(inpainting)流程生成源-目标图像对,并配以人工绘制的涂鸦和基于视觉语言模型(VLM)生成的文本指令,从而构建多模态输入的训练样本。解决方案的关键在于利用该合成数据集对扩散模型和自回归统一多模态图像编辑模型进行微调,显著提升模型对抽象涂鸦与文本联合理解的能力,实现空间对齐且语义一致的可控编辑效果。
链接: https://arxiv.org/abs/2605.01135
作者: Anya Ji,George Ma,Téa Wright,Yiming Zhang,David Chan,Alane Suhr,Somayeh Sojoudi
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in generative models has significantly advanced image editing capabilities, yet precise and intuitive user control remains difficult. Specifically, users often struggle to communicate both exact spatial layouts and specific semantic details simultaneously. While natural language instructions effectively convey high-level semantics like texture and color, they lack spatial specificity. Conversely, freehand scribbles provide rough spatial boundaries but cannot express detailed visual attributes. Consequently, achieving precise control requires combining both modalities. However, existing models struggle to jointly interpret abstract scribbles alongside text due to a lack of specialized training data. In this work, we introduce ScribbleEdit, a large-scale synthetic dataset designed to bridge this gap by combining natural language instructions with freehand scribble inputs for more accurate, controllable edits. We construct this dataset through a synthetic pipeline that automatically generates source-target image pairs via inpainting, which are then paired with human-drawn scribbles and VLM-generated text instructions. Using ScribbleEdit, we evaluate and finetune both diffusion-based and autoregressive unified multimodal image editing models. Our experiments reveal that while off-the-shelf models struggle with abstract scribble inputs, finetuning on our synthetic dataset significantly improves their ability to generate spatially aligned and semantically consistent edits. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.01135 [cs.CV] (or arXiv:2605.01135v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.01135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-184] Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成过程中可能因恶意输入而产生不安全内容(如NSFW图像)的安全问题,尤其针对现有基于文本或图像的二元过滤机制易受对抗攻击和误报率高的缺陷。其解决方案的关键在于提出一种名为“纪律扩散”(Disciplined Diffusion, DDiffusion)的新方法:首先通过语义检索机制对提示词嵌入进行隐式恶意语义识别,而非依赖脆弱的成对相似性匹配;其次在扩散过程中引入定位机制,仅对生成图像中有害区域进行局部编辑,从而实现对恶意内容的有效抑制,同时保留良性提示的生成保真度,并避免传统方案中引发对抗攻击的全局允许/拒绝反馈信号。
链接: https://arxiv.org/abs/2605.01113
作者: Chi Zhang,Changjia Zhu,Xiaowen Li,Yao Liu,Zhuo Lu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) diffusion models have the ability to build high-quality pictures from text prompts, but they pose safety concerns because they can generate offensive or disturbing imagery when provided with harmful inputs. Existing safety filters typically rely on text-based classifiers or image-based checkers that completely block the output upon detecting a threat, issuing an explicit allow/block feedback signal to the user. This binary strategy leaves models vulnerable to adversarial attacks that alter keywords to bypass detection, and it causes high false-alarm rates that degrade the experience for benign users. To address such vulnerabilities, we propose Disciplined Diffusion (DDiffusion), a novel robust text-to-image diffusion that counters Not Safe For Work (NSFW) generation by uncovering implicit malicious semantics in prompt embeddings. DDiffusion leverages a semantic retrieval mechanism to evaluate prompts against concept distributions rather than relying on brittle pairwise similarity. Furthermore, it employs a localization method during the diffusion process to selectively edit only the harmful regions of the generated image. By returning locally sanitized images instead of applying uniform blocking, DDiffusion suppresses malicious content while preserving generation fidelity for benign prompts and avoiding the binary allow-deny signal on which existing probing attacks rely.
[CV-185] Almost for Free: Crafting Adversarial Examples with Convolutional Image Filters
【速读】:该论文旨在解决生成对抗样本(adversarial examples)时对梯度依赖性强、计算复杂度高以及参数量大的问题。传统方法通常需要访问模型梯度或通过大量查询来近似梯度,而本文提出了一种更简洁高效的解决方案:基于可解释机器学习的洞察,设计出基于经典边缘检测算法的对抗图像滤波器(adversarial image filters),并通过优化使其能够欺骗神经网络分类器。其关键创新在于利用3×3小尺寸滤波器实现无目标攻击(untargeted attack),仅需单次输入遍历即可达到30%–80%的成功率,且具有良好的迁移性(transferability)。相比基于生成模型的方法,该方案将参数数量减少五个数量级,显著提升了攻击效率,同时揭示了神经网络对恶意噪声的高度脆弱性。
链接: https://arxiv.org/abs/2605.01098
作者: Alexander Warnecke,Konrad Rieck
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial examples in machine learning are typically generated using gradients, obtained either directly through access to the model or approximated via queries to it. In this paper, we propose a much simpler approach to craft adversarial examples, drawing inspiration from insights of explainable machine learning. In particular, we design \emphadversarial image filters that are based on classic edge detection algorithms but optimized to deceive learning models. The resulting untargeted attacks are transferable and require only a single pass over the input. Empirically, we find that 3x3 filters already enable success rates between 30% and 80% on different neural networks. Compared to related approaches using generative models for crafting adversarial examples, we reduce the number of parameters by five orders of magnitude, resulting in a very efficient attack. When investigating the parameters of the learned filters, we observe interesting properties such as a high transferability between models and structures common to classic image filters. Our results provide further insights into the vulnerability of neural networks and their fragility to malicious noise.
[CV-186] Patient-Specific Optimization for Mandibular Reconstruction Planning with Enhanced Bone Union
【速读】:该论文旨在解决下颌骨重建中因供区与受区骨端不愈合(donor-host nonunion)导致的临床难题,现有虚拟手术规划仅提供几何构型而未明确优化骨愈合条件。其解决方案的关键在于提出OsteoOpt++——一种从影像到决策的闭环规划流程:首先通过模板-患者配准和CT衍生的肌肉及颞下颌关节参数更新构建个性化数字孪生模型;随后利用贝叶斯优化结合期望改进增益(expected-improvement-plus)采集函数,在六个临床可控变量(切口平面与供骨位置)上搜索最优配置,目标函数以骨端贴合度驱动,并引入安全因子正则化约束。该方法在通用缺损和患者特异性病例中均显著提升骨端贴合度,且对建模参数具有鲁棒性,最终为术前提供可预测改善骨愈合条件的切口方向与供骨定位建议。
链接: https://arxiv.org/abs/2605.01084
作者: Hamidreza Aftabi,John E. Lloyd,Amanda Ding,Benedikt Sagl,Eitan Prisman,Antony Hodgson,Sidney Fels
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mandibular reconstruction with vascularized bone grafts is complicated by donor-host nonunion, and current virtual surgical planning produces a geometric plan rather than a configuration that explicitly promotes bone union. We present OsteoOpt++, an image-to-decision planning loop for patient-specific mandibular reconstruction. A pre-operative computed tomography (CT) is converted into a personalized digital twin through template-to-patient registration and CT-derived updates of the muscle and temporomandibular-joint parameters. Bayesian optimization with an expected-improvement-plus acquisition rule then searches six clinically controllable cut-plane and donor-positioning variables under an apposition-driven objective and a safety-factor-regularized variant. The workflow was evaluated on three generic defects (body, symphysis, and ramus-body) and a total of 3+1 patient-specific cases, with 3 used for optimization and 1 for validation. In the generic cases, against a common surgical approach, cycle-averaged donor-mandible apposition increased by up to 29 percentage points (329% relative); in the patient-specific cases, against the surgeon-implemented day-5 post-operative configuration, by up to 26 percentage points. A 10% sensitivity analysis over eleven modeling parameters capped the change in the apposition-driven objective at 3% for generic cases and 4% for patient-specific cases, and the longitudinal case showed Dice overlap of 0.70 and 0.76 between predicted apposition and year-1 bone formation. Clinically, this provides surgeons with a pre-operative, image-driven recommendation for cut-plane orientation and donor placement that is predicted to improve union conditions over the configurations currently delivered in the operating room. The optimization and patient-specific modeling code is open source at this https URL.
[CV-187] WILD SAM: A Simulated-and-Real Data Augmentation for Autonomous Driving Perception under Challenging Weather
【速读】:该论文旨在解决恶劣天气条件下目标检测器性能显著下降的问题,即由于天气变化导致的域偏移(domain shift),这对自动驾驶车辆的安全性构成严重威胁。现有方法多依赖合成数据训练,限制了实际应用效果;而伪标签(pseudo-labeling)虽在跨数据集域自适应中广泛应用,却因恶劣天气下生成标签噪声较大而未被有效利用。论文提出两种解决方案:其一为Weather-Induced pseudo Label Denoising (WILD) 框架,通过过滤真实恶劣天气数据生成的噪声伪标签来提升标签质量;其二为WILD SAM混合训练方法,结合伪标签去噪与基于仿真的训练策略,并充分利用目标域的真实恶劣天气数据。关键创新在于首次将伪标签去噪机制引入天气域自适应场景,从而显著改善模型在雨雪等复杂环境下的检测性能,实验表明AP最高提升13%,有效缩小了天气诱导的性能差距。
链接: https://arxiv.org/abs/2605.01081
作者: Hamed Khatounabadi,Xiaohu Lu,Hayder Radha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The performance of state-of-the-art object detectors degrades significantly under adverse weather, causing a safety-critical domain shift problem for autonomous vehicles. Recent efforts address this problem by relying on synthetic data to train the object detectors, which limits their real-world applicability. Meanwhile, pseudo-labeling is widely used for cross-dataset domain adaptation problems. However, these methods have not been exploited by weather-based domain adaptation approaches due to the noisy nature of such labels generated under harsh weather conditions. In this paper, we propose two new approaches to mitigate this weather-induced domain shift. First, we propose a Weather-Induced pseudo Label Denoising (WILD) framework that filters noisy pseudo labels generated by real data captured under adverse weather conditions. Second, we develop a novel hybrid training methodology, WILD SAM, that exploits both pseudo-label denoising and simulation-based training solutions while using real-data from the target harsh-weather domain. We validate both proposed approaches, WILD and WILD SAM, on the recently released Four Seasons dataset across rainy and snowy scenarios. Experiments show that the proposed frameworks improve Average Precision (AP) up to 13% and significantly reduce the weather-induced performance gap relative to the baseline. The code is available at: this https URL
[CV-188] Neighbor2Inverse: Self-Supervised Denoising for Low-Dose Region-of-Interest Phase Contrast CT
【速读】:该论文旨在解决传播型X射线相位对比成像(Propagation-based X-ray phase-contrast imaging, PBI)在临床转化中因辐射剂量降低导致图像噪声增加的问题。传统监督式去噪方法依赖于配对的低剂量与高剂量数据集,而此类数据在实际中难以获取;现有自监督方法虽避免了这一限制,却未充分适配PBI-CT的逆问题特性。本文提出一种名为Neighbor2Inverse的新型自监督去噪框架,其核心创新在于基于“Neighbor2Neighbor”原理:将每张噪声投影图像采样为两个保留结构信息但含独立噪声实现的子图,分别重建后形成图像对,直接在图像域训练去噪网络。该方法无需配对数据即可有效抑制噪声并保持细粒度结构细节,在区域感兴趣PBI-CT实验中显著提升信噪比、空间分辨率及综合图像质量指标,并在模拟低剂量条件下的临床CT数据上展现出竞争力。
链接: https://arxiv.org/abs/2605.01075
作者: Johannes B. Thalhammer,Lorenzo D’Amico,Lucy Costello,Sebastian Peterhansl,Daniel Frey,Tina Dorosti,Florian Schaff,Jannis Ahlers,Ronan Smith,Marcus Kitchen,Franz Pfeiffer,Martin Donnelley,Daniela Pfeiffer,Kaye S. Morgan
机构: TU Munich (慕尼黑工业大学); Monash University (莫纳什大学); Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Propagation-based X-ray phase-contrast imaging (PBI) enables high-contrast visualization of lung structures and holds strong medical potential. However, safe translation to the clinic will require a substantial radiation dose reduction, which inevitably increases image noise. Supervised convolutional-neural-network-based denoising can restore image quality but depends on paired low- and high-dose datasets, which are rarely available in practice. Self-supervised methods avoid this limitation, yet most are not well adapted to the inverse problem of PBI computed tomography (CT). We introduce Neighbor2Inverse, a self-supervised denoising framework designed for low-dose PBI-CT that generalizes to clinical CT. Building on the Neighbor2Neighbor principle, each noisy projection is subsampled into two variants that preserve structural information but contain independent noise realizations. These are reconstructed separately, and the resulting pairs are used to train a denoising network directly in the image domain. We benchmark the proposed method against established analytical and self-supervised denoising approaches. In region-of-interest PBI CT experiments, Neighbor2Inverse achieves superior noise suppression while preserving fine structural details, as demonstrated by improved contrast-to-noise ratio, spatial resolution, and composite image quality metrics. Competitive performance is also observed on clinical CT data under simulated low-dose conditions. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Code, data, and interactive figures are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.01075 [cs.CV] (or arXiv:2605.01075v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.01075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-189] GEODE: Angle-Adaptive OOD Detection with Universal Scorer Compatibility
【速读】:该论文旨在解决现有训练型异常检测方法(如Outlier Exposure, OE)中存在的 scorer-dependent tradeoffs 问题,即在不同评分器(如MSP、KNN)上表现不一致,且依赖于人工标注的辅助数据。其核心发现是:OE 的有效性源于其特征空间中样本分布与近域异常数据(near-OOD data)处于相同的几何位置,尤其是边界邻近四分位数区域贡献了几乎全部性能提升——这本质上是一种边界校准(boundary calibration),而非覆盖范围扩展。为此,作者提出 GEODE(GEOmetry-preserving DEtection),通过引入一种角度自适应范数损失(angle-adaptive norm loss),使目标值根据每个样本与最近类别均值的余弦相似度动态调整,从而在保持特征几何结构的前提下实现边界校准。四个基于神经坍缩(neural collapse)理论的定理支撑了该设计,GEODE 在 CIFAR-10 上对所有七种标准评分器均表现优异(近域 OOD AUROC 达 89.0–92.3,远域达 93.05),且无需额外数据,在多个基准测试中超越 OE 和传统交叉熵(CE)训练方法。
链接: https://arxiv.org/abs/2605.01063
作者: Bruno Abrahao
机构: NYU Shanghai; Leonard N. Stern School of Business, NYU
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Outlier Exposure (OE) is among the strongest training-based OOD detectors on standard benchmarks but exhibits scorer-dependent tradeoffs (e.g., strong on MSP, weak on KNN) and requires curated auxiliary data. We show why OE works: its features sit at the same geometric locus as real near-OOD data, with the boundary-adjacent quartile driving nearly all of OE’s gain. OE is boundary calibration, not OOD coverage. GEODE (GEOmetry-preserving DEtection) replicates this calibration synthetically through an angle-adaptive norm loss in which targets scale per-sample with cosine similarity to the nearest class mean, preserving feature geometry where boundary structure matters. Four theorems grounded in neural collapse justify the design. GEODE works across all seven standard scorers on CIFAR-10 (near-OOD AUROC 89.0-92.3, far-OOD reaching 93.05; no catastrophic failure on any scorer). Since the OOD regime is unknown at deployment, this is the test that matters. GEODE outperforms vanilla CE at matched epoch counts. Combined with OE, GEODE reaches 95.0 MSP / 94.8 KNN on CIFAR-10 and beats OE on every scorer on CIFAR-100. The gains hold on WRN-28-10 (+4.5 Energy, 3 seeds). Unlike methods that push OOD into the classifier null space (e.g., PFS, 14.38 KNN AUROC, worse than random), GEODE’s adaptive target preserves the geometry that distance-based scorers depend on.
[CV-190] InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene CVPR2026
【速读】:该论文旨在解决动态场景中物理感知的人体运动合成问题,现有方法因接触建模能力有限(通常仅限于手部)而难以生成符合物理规律的运动。其解决方案的关键在于提出了一种显式建模人体相关力全谱的框架,包括人-物体、人-场景以及内部身体动力学,并通过软物理约束确保力与力矩平衡,从而实现物理合理的运动生成;此外,还引入一种新颖的基于连续距离的力模型,将接触建模推广至任意表面,有效捕捉与静态环境及动态移动物体之间的交互关系。
链接: https://arxiv.org/abs/2605.01036
作者: Chaoyue Xing,Wei Mao,Miaomiao Liu
机构: Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026
Abstract:This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics.~Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.
[CV-191] EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中,当不同模态信息存在冲突或缺失时,多模态大语言模型(Multimodal Large Language Models, MLLM)内部决策机制不透明、易产生偏差的问题。核心挑战在于MLLM在面对高冗余视频特征和模态偏好时,会显著弱化视频模态的贡献,导致“视频贡献崩溃”(Video Contribution Collapse, VCC)现象。解决方案的关键是提出一种轻量级的推理时注意力调控机制——冲突感知头级注意力引导(Conflict-aware Head-level Attention Steering, CHASE),该机制能够检测模态冲突并动态调整注意力权重,在无需重新训练主干模型的前提下有效缓解决策偏差,从而提升MLLM在复杂情感场景下的可靠性与鲁棒性。
链接: https://arxiv.org/abs/2605.01024
作者: Yueru Sun,Yimeng Zhang,Haoyu Gu,Nuo Chen,Dong She,Xianrong Yao,Yang Gao,Zhanpeng Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Emotion Recognition (MER) is critical for interpreting real-world interactions. While Multimodal Large Language Models (MLLM) have shown promise in MER, their internal decision-making mechanisms under modality conflict and missingness remain largely underexplored. In this paper, to systematically investigate these behaviors, we introduce EmoMM, a comprehensive benchmark featuring modality-aligned, conflict, and missing subsets. Through extensive evaluation, we uncover a Video Contribution Collapse (VCC) phenomenon, where MLLM marginalize video evidence due to high token redundancy and modality preferences. To address this, we propose Conflict-aware Head-level Attention Steering (CHASE), a lightweight mechanism that detects modality conflicts and performs inference-time attention steering, effectively mitigating decision bias without retraining the backbone. Experimental results demonstrate that CHASE consistently improves performance across various settings, significantly enhancing the reliability of MLLM in complex affective scenarios.
[CV-192] WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
【速读】:该论文旨在解决当前多模态基础模型在真实场景下表格图像理解能力评估不足的问题。现有评测大多依赖结构化文本表格或干净渲染的图像,未能充分覆盖现实世界中表格图像所具有的复杂布局与多样领域特性,导致模型在结构感知和数值推理方面的局限性未被有效揭示。解决方案的关键在于构建首个面向自然场景表格图像的问答基准WildTableBench,其包含402张高信息密度的真实表格图像及928个手工标注并验证的问题,涵盖17种子类型和5个类别。通过在此基准上对21个前沿多模态基础模型进行系统评估,发现仅有1个模型准确率超过50%,其余模型表现显著偏低,从而揭示了模型在结构感知和推理上的持续性缺陷,为后续研究提供了诊断工具和改进方向。
链接: https://arxiv.org/abs/2605.01018
作者: Junzhe Huang,Xiaoxiao Sun,Yan Yang,Yuxuan Hou,Ruotian Zhang,Sirui Li,Hehe Fan,Serena Yeung-Levy,Xin Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding.
[CV-193] Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching
【速读】:该论文旨在解决生成式模型中流匹配(Flow Matching)方法的样本不确定性量化问题,这是当前生成建模领域尚未充分解决的挑战。现有方法通常需要重新训练模型以引入辅助方差头、维护昂贵的集成模型,或在多个积分步骤中传播近似协方差,从而在训练成本、推理开销与精度之间做出权衡。本文的关键创新在于提出“发散-不确定性恒等式”(divergence-uncertainty identity),证明了对于任意预训练的流匹配速度场,给定当前状态时干净数据后验协方差的迹,可精确表示为速度场散度的闭式表达,仅依赖于已知的时间相关前因子和一个常数项;矩阵形式则完全由速度场的雅可比矩阵决定。这一恒等式是精确且事后的(post-hoc),无需重新训练或修改模型架构即可直接计算不确定性,尤其对单步生成器如MeanFlow而言,可在一次前向传播中获得端到端的生成不确定性,彻底避免了以往方法所需的多步方差传播过程,显著提升效率并保持准确性。
链接: https://arxiv.org/abs/2605.00941
作者: Jiarui Xing,Song Wang,Jian Wang
机构: Yale University (耶鲁大学); University of Central Florida (中佛罗里达大学); Boston Children’s Hospital, Harvard Medical School (波士顿儿童医院,哈佛医学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 5 figures
Abstract:Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. We prove that, for any pre-trained flow matching velocity field, the trace of the posterior covariance over the clean data given the current state equals, in closed form, the divergence of the velocity field, up to a known time-dependent prefactor and an additive constant. We call this the \emphdivergence-uncertainty identity for flow matching. The matrix-level form of the identity is similarly closed-form, depending solely on the velocity Jacobian. Because the identity is exact and post-hoc, it is computable on any pre-trained flow matching model, with no retraining and no architectural modification. For one-step generators such as MeanFlow, the same identity yields the exact end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly 10,000 \times less total compute than ensembling or Monte Carlo dropout.
[CV-194] Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding ICML2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)中 timestep embedding(时间步嵌入)这一关键组件在安全性与可追溯性方面长期被忽视的问题,尤其是其作为潜在侧信道(side channel)可能被恶意注入信息的风险。解决方案的核心是提出一种名为 Shadow Timestep Embedding (STE) 的新机制,通过理论分析揭示不同时间步具有差异化的表征能力,能够编码侧信道信息,并利用调度器接口(scheduler interface)实现攻击或防御目的。研究进一步从位置编码映射的角度对 timestep embedding 进行建模,并推导出互相干性评估指标以解释不相交时间区间间的可分离性,从而证明时间维度是扩散模型中一个强大的信息载体,为对抗性生成建模开辟了新的方向。
链接: https://arxiv.org/abs/2605.00935
作者: An Huang,Junggab Son,Zuobin Xiong
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, accepted to ICML 2026
Abstract:Diffusion models have become the foundation of modern generative systems, with most research focusing primarily on improving generation efficiency and output quality. The timestep embedding component is a crucial part of the diffusion pipeline, which provides a temporal conditioning signal to the denoising network, enabling it to adapt its predictions across different noise levels throughout the process. Despite their potential to contain substantial information, timestep embeddings remain underexplored in current research, especially for security risks and reliable provenance. To fill this gap, we introduce Shadow Timestep Embedding (STE), a novel mechanism that investigates the underutilized temporal space for malicious information injection into diffusion models. In particular, when zooming in on the timestep embedding space, we find that different timesteps exhibit distinct representational capabilities that can encode side-channel information. Moreover, such encoded information can be utilized for attack and defense purposes through the scheduler interface. We present a theoretical analysis of timestep embeddings as position-encoding mappings and derive a mutual coherence evaluation that explains the separability of disjoint timestep intervals. Our findings reveal the diffusion model’s timestep as a powerful side channel for carrying dedicated information, motivating new directions for adversarial generative modeling by understanding the temporal dimension.
[CV-195] Linking spatial biology and clinical histology via Haiku
【速读】:该论文旨在解决多模态生物医学数据(分子、形态学和临床信息)在基础与转化研究中难以系统整合建模的问题。其核心解决方案是提出Haiku模型,一个基于多光谱免疫荧光(mIF)数据训练的三模态对比学习框架,能够将空间蛋白质组学图像、HE组织病理学图像与临床元数据对齐至共享嵌入空间,实现跨模态检索、下游分类预测及零样本生物标志物推断。关键创新在于通过大规模数据集(2670万空间蛋白组补丁,覆盖11种器官类型)构建统一表征,并引入反事实预测机制,仅改变临床元数据即可揭示特定组织微环境中与疾病进展相关的分子变化模式,从而实现从结构到功能的生物学探索。
链接: https://arxiv.org/abs/2605.00925
作者: Yan Cui,Jacob S. Leiby,Wenhui Lei,Dokyoon Kim,Yanxiang Deng,Aaron T. Mayer,Zhenqin Wu,Alexandro E. Trevino,Zhi Huang
机构: University of Pennsylvania (宾夕法尼亚大学); Enable Medicine (Enable Medicine)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:Integrating molecular, morphological, and clinical data is essential for basic and translational biomedical research, yet systematic frameworks for jointly modeling these modalities remain limited. Here we present Haiku, a tri-modal contrastive learning model trained on multiplexed immunofluorescence (mIF). It comprises 26.7 million spatial proteomics patches from 3,218 tissue sections across 1,606 patients spanning 11 organ types, with matched hematoxylin and eosin (HE) histology and clinical metadata aligned in a shared embedding space. Haiku enables three-way cross-modal retrieval, improves downstream classification and clinical prediction tasks over unimodal baselines, and supports zero-shot biomarker inference through fusion retrieval conditioned on clinical metadata-only text descriptions. Across tasks, Haiku outperforms competing approaches, achieving cross-modal retrieval (Recall@50 up to 0.611 versus near-zero baseline), survival prediction (C-index 0.737, +7.91% relative improvement), and zero-shot biomarker inference (mean Pearson correlation 0.718 across 52 biomarkers). Furthermore, we introduce a counterfactual prediction framework in which modifying only clinical metadata while fixing tissue morphology surfaces niche-specific molecular shifts associated with breast cancer stage progression and lung cancer survival outcomes. In a lung adenocarcinoma case study, the counterfactual analysis recovers niche-specific shifts characterized by increased CD8 and granzyme B, reduced PD-L1, and decreased Ki67, broadly consistent with patterns reported for favorable outcomes. We present these counterfactual results as exploratory, hypothesis-generating signals rather than mechanistic claims. These capabilities demonstrate that tri-modal alignment via Haiku enables integrative analysis of spatial biology, bridging molecular measurements with clinical context for biological exploration.
[CV-196] SAMamba3D: adapting Segment Anything for generalizable 3D segmentation of multiphase pore-scale images
【速读】:该论文旨在解决多相孔隙尺度X射线图像中分割方法泛化能力差的问题,即现有3D分割方法通常依赖特定数据集,当岩石类型、流体分布、扫描设备或成像条件变化时需重新训练或大量微调,导致效率低下且难以推广。其解决方案的关键在于提出SAMamba3D框架,通过将大体冻结的Segment Anything Model(SAM)编码器与基于Mamba的体积上下文建模及渐进式跨尺度特征交互相结合,实现对不同岩性、流体性质和扫描条件下的通用3D孔隙尺度分割,显著降低对特定场景再训练的需求,同时保持物理上合理的流体饱和度、连通性和界面形貌等描述符,从而提升大规模多相3D图像分析的可靠性与效率。
链接: https://arxiv.org/abs/2605.00916
作者: Rui Zhang,Xianzhi Song,Linqi Zhu,Branko Bijeljic,Gensheng Li,Martin J. Blunt
机构: Imperial College London (帝国理工学院); China University of Petroleum (Beijing) (中国石油大学(北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL
Abstract:Reliable segmentation of multiphase pore-scale X-ray images of rocks is necessary to quantify fluid saturation, connectivity, and interfacial geometry. However, current 3D segmentation methods are typically dataset-specific, requiring retraining or extensive fine-tuning whenever rock type, fluid pattern, scanner, or acquisition conditions change. Foundation models such as the Segment Anything Model (SAM) provide strong 2D boundary priors, but they are not directly applicable to 3D data. We present SAMamba3D, a parameter-efficient framework that adapts a largely frozen SAM encoder to generalizable 3D pore-scale segmentation by coupling it with Mamba-based volumetric context modeling and progressive cross-scale feature interaction. For sandstone and carbonate datasets, with different fluids, wettability, and scanning conditions, SAMamba3D matches or outperforms current 3D baselines while reducing the need for case-specific retraining. The resulting segmented images preserve physically meaningful descriptors, including fluid saturation, connectivity, and interface morphology, enabling more reliable and rapid analysis of large 3D multiphase images. Comments: Code available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.00916 [cs.CV] (or arXiv:2605.00916v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.00916 Focus to learn more arXiv-issued DOI via DataCite
[CV-197] Rethink MAE with Linear Time-Invariant Dynamics
【速读】:该论文旨在解决现有视觉模型表征探测方法中忽视token顺序信息的问题,传统方法如全局平均池化(Global Average Pooling, GAP)或CLS token提取依赖于置换不变操作,将patch表示视为无序集合,从而忽略了token序列顺序所蕴含的重要结构信息。其解决方案的关键在于提出SSMProbe框架,该框架基于状态空间模型(State Space Model, SSM),利用SSM作为具有记忆衰减特性的离散线性时不变(Linear Time-Invariant, LTI)动力系统,对token顺序敏感,能够有效捕捉并利用预训练视觉模型中token排列的语义差异。通过将token排序建模为信息调度问题,并比较固定扫描策略与可微软置换(基于Sinkhorn算法学习)的效果,研究发现:学习到的软排列能显著提升对高度局部化patch特征的利用效率,揭示了不同预训练目标导致token结构在顺序依赖性上的本质差异,为视觉表征分析提供了新的诊断工具。
链接: https://arxiv.org/abs/2605.00915
作者: Zice Wang
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard representation probing for visual models relies on mathematically permutation-invariant operations like Global Average Pooling (GAP) or CLS tokens, treating patch representations as an unstructured bag-of-words. We challenge this paradigm by demonstrating that token order is a critical, exploitable dimension in frozen visual representations (e.g., MAE, BEiT, DINOv2, and ViT as CLS-ablation extreme). We propose SSMProbe, a probing framework driven by a State Space Model (SSM). Operating as discrete Linear Time-Invariant (LTI) dynamical systems, SSMs act as permutation-sensitive probes where sequence order strictly dictates the final state due to inherent memory decay. Formulating token ordering as an information scheduling problem, we compare fixed scan heuristics against a differentiable soft permutation (Sinkhorn-based) learned from downstream supervision. Evaluations on standard and fine-grained classification benchmarks reveal a striking order gap: while fixed scans fail dramatically on highly localized patch features, our learned soft permutation successfully extracts highly competitive performance from otherwise heavily localized patch sequences. We find that pre-training objectives fundamentally shape token structure: DINOv2 concentrates global semantics in optimized CLS tokens leaving patches hyperspecialized, pure MAE preserves distributed representations with heterogeneous patch informativeness, and ViT represents a supervised CLS-dominated extreme. BEiT occupies middle ground. This heterogeneity is order-dependent – meaning the SSM probe’s performance depends critically on which tokens are placed at which temporal positions – and is not merely a topological property of the spatial grid. SSMProbe’s learned routing effectively discovers and exploits this heterogeneity, offering a powerful new diagnostic lens for visual representation analysis.
[CV-198] Leverag ing Imperfect Medical Data: A Manifold-Consistent Spatio-Temporal Network for Sensor-based Human Activity Recognition
【速读】:该论文旨在解决可穿戴传感器在真实医疗物联网(IoMT)场景中因数据缺失、传感器故障和环境噪声导致的传感信号退化问题,此类不完美感知条件显著削弱了传统深度学习模型在人体活动识别(HAR)任务中的性能。其解决方案的关键在于提出一种流形一致性时空网络(MCSTN),通过双层退化建模机制模拟物理级退化与扩散驱动的连续退化,并强制多退化视图下的表征一致性,从而学习稳定且对退化不变的语义表示;同时设计双流时空架构,显式分离时间动态建模与空间相关性学习,分别由时序流捕捉长期活动动态、空间流建模跨传感器关系,实现更有效的时空特征提取。
链接: https://arxiv.org/abs/2605.00913
作者: Jiangtao Fan,Anish Jindal,Amir Atapour-Abarghouei
机构: Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sensor-based Human Activity Recognition (HAR) has attracted increasing attention in medical and healthcare monitoring, particularly with the growth of Internet of Medical Things (IoMT). However, in real-world wearable sensing scenarios, IoMT signals are often corrupted by missing measurements, sensor failures, and environmental noise, which significantly degrade the performance of conventional deep learning models that assume clean and complete inputs. To address this challenge, we propose a Manifold-Consistent Spatio-Temporal Network (MCSTN) for robust HAR under imperfect sensing conditions. The proposed framework introduces a dual-level corruption modeling mechanism that simulates realistic sensor imperfections through both physical-level corruption and diffusion-driven continuous corruption. By enforcing representation consistency across multiple corrupted views, the model learns stable and corruption-invariant semantic representations. Furthermore, we design a dual-stream spatio-temporal architecture that explicitly decouples temporal dynamics modeling and spatial correlation learning. The temporal stream captures long-term activity dynamics, while the spatial stream models inter-sensor relationships, enabling more effective spatio-temporal representation learning. Extensive experiments on three widely used HAR benchmark datasets, PAMAP2, Opportunity, and WISDM, demonstrate that the proposed MCSTN achieves competitive performance compared with existing state-of-the-art methods, particularly under imperfect sensing conditions. These results validate the effectiveness and robustness of the proposed framework for real-world wearable IoMT sensing applications.
[CV-199] Object-Level Explanations for Image Geolocation Models: a GeoGuessr use-case
【速读】:该论文旨在解决图像地理定位模型(image geolocation models)在决策过程中是否依赖于具体视觉对象层面证据的问题。由于传统归因方法(如Grad-CAM)通常生成模糊区域而非可识别的物体,难以将模型预测与特定物体或感知模式关联,因此其内部机制缺乏可解释性。解决方案的关键在于提出一种面向对象的分析流程:从归因图出发,提取显著区域并将其分割为类物体元素,再通过删除和插入测试评估这些元素对模型预测的贡献,从而验证归因引导裁剪区域相较于随机区域具有更高的预测信息保留能力。实验表明,该方法能够将归因图分解为可解释且可感知的视觉单元,为实现地理定位模型的对象级解释提供了有效路径。
链接: https://arxiv.org/abs/2605.00912
作者: Emilie Durrieu,Christophe Hurter,Philippe Muller,Victor Boutin
机构: ENAC, University of Toulouse; IRIT, University of Toulouse, ANITI; CNRS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:When humans play geolocation games such as GeoGuessr, they rely on concrete visual cues, such as road markings, vegetation, or architectural details, to infer where an image was captured. Whether image geolocation models rely on similar object-level evidence remains difficult to determine, as attribution methods like Grad-CAM typically highlight diffuse regions rather than coherent visual entities, making it difficult to link model predictions to specific objects or perceptible patterns. In this work, we propose an object-centric analysis pipeline to investigate the visual evidence used by geolocation models. Starting from attribution maps, we extract salient regions and segment them into object-like elements. We evaluate their predictive relevance through deletion and insertion tests, comparing attributionguided crops to randomly selected regions with similar coverage. Experiments on a three-country benchmark show that attribution-guided crops consistently retain more information for the model’s prediction than random crops. These results suggest that attribution maps can be decomposed into interpretable, perceptible elements, providing a step toward object-level analysis of geolocation models.
[CV-200] When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation
【速读】:该论文旨在解决工业级检索增强生成(Retrieval-Augmented Generation, RAG)系统中光学字符识别(Optical Character Recognition, OCR)模块的评估瓶颈问题,即现有基于字符级错误率(如词错误率 Word Error Rate, WER 或字符错误率 Character Error Rate, CER)的OCR基准无法有效反映真实场景下下游RAG任务的性能表现。其解决方案的关键在于构建了一个覆盖11类工业复杂文档类型的OCR基准测试集,涵盖极端版式、高分辨率页面、复杂背景、历史文档非标准阅读顺序、装饰性文本及含表格和数学公式等内容;并通过受控的OCR-first RAG流水线验证发现,即使OCR模型在传统指标上表现优异,仍可能因结构或语义层面的错误导致显著的检索失败,揭示了OCR准确率与下游RAG有效性之间存在不一致性,且该现象具有类别依赖性和稳定性。
链接: https://arxiv.org/abs/2605.00911
作者: Lin Sun,Wang Dexian,Jingang Huang,Linglin Zhang,Change Jia,Zhengwei Cheng,Xiangzheng Zhang
机构: Beijing Qiyuan Technology(北京奇元科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at this https URL.
[CV-201] Comparative Evaluation of Convolutional and Transformer-Based Detectors for Automated Weed Detection in Precision Agriculture
【速读】:该论文旨在解决在真实场景下早期杂草检测的准确性与计算效率之间的权衡问题。其解决方案的关键在于对比卷积神经网络(Convolutional Neural Networks, CNN)与基于Transformer的物体检测架构(如YOLOv26-nano、RTDETR和RF-DETR)在GROUNDBASED_WEED数据集上的性能表现,通过精度(precision)、召回率(recall)、平均精度(average precision)及推理速度等指标进行系统评估,从而为精准农业应用中模型选择提供量化依据。
链接: https://arxiv.org/abs/2605.00908
作者: Alcides Toledo Espinosa,Gerardo Antonio Álvarez Hernández,Ángel Eduardo Zamora-Suárez,Miguel Bolaños,Juan Irving Vásquez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, and 1 table
Abstract:This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in realistic scenarios. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and transformer-based approaches such as RTDETR and RF-DETR. Experiments were conducted on the GROUNDBASED_ WEED dataset, allowing performance to be evaluated in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.
[CV-202] RIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多模态大模型(Multimodal Large Models, MLLMs)在交通领域应用中缺乏系统性、细粒度且工程对齐的评估基准问题。现有通用基准难以验证模型在规则遵循、可验证工程计算、交通场景可靠理解等方面的能力,而现有的交通专用基准则覆盖范围窄、不支持跨模态(文本、图像、点云)的精细化诊断。解决方案的关键在于提出TRIP-Evaluate——一个开放的多模态交通任务评估基准,其核心创新包括:基于角色-任务-知识的分类体系组织837个测试项,每项标注能力类别、模态类型与难度标签,从而实现从整体准确率到具体失败模式的逐层诊断;同时标准化测试项构建、质量控制、提示设计、解码策略与评分机制,提升不同模型间的可比性。该基准为模型选型、回归测试及更安全的交通场景部署提供了可复现、可诊断、工程导向的评估基线。
链接: https://arxiv.org/abs/2605.00907
作者: Han Gong,Zhen Zhou,Yunyang Shi,Yan Tan,Jinbiao Huo,Qi Hong,Zhiyuan Liu
机构: Southeast University (东南大学); Jiangnan University (江南大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 12 figures
Abstract:Large language models (LLMs) and multimodal large models (MLLMs) are increasingly used for transportation tasks such as regulation question answering, traffic management support, engineering review, and autonomous-driving scene reasoning. Yet transportation workflows are rule-intensive, computation-intensive, safety-critical, and inherently multimodal. Existing general benchmarks provide limited evidence of whether a model can apply regulations correctly, perform verifiable engineering calculations, or interpret traffic scenes reliably, while the small number of public transportation benchmarks remain narrow in scope and rarely support fine-grained diagnosis across text, images, and point-cloud data. To address this gap, we present TRIP-Evaluate, an open multimodal benchmark for large models in transportation. The benchmark organizes 837 items using a role-task-knowledge taxonomy that covers vehicle, traffic-management, traveler, and planning-and-design functions. Each item is annotated with capability, modality, and difficulty labels, enabling diagnosis from overall accuracy down to specific failure modes. The current release includes 596 text items, 198 image items, and 43 point-cloud items. TRIP-Evaluate also standardizes item construction, quality control, prompting, decoding, and scoring to improve cross-model comparability. Results on a diverse panel of models show that text-based performance is improving, but substantial weaknesses remain in multi-step engineering calculation, rule-constrained reasoning, multimodal scene understanding, and point-cloud understanding. Overall, TRIP-Evaluate provides a reproducible, diagnosable, and engineering-aligned evaluation baseline for model selection, regression testing, and safer deployment in transportation applications.
[CV-203] Generalized Category Discovery under Domain Shifts: From Vision to Vision-Language Models
【速读】:该论文旨在解决**广义类别发现(Generalized Category Discovery, GCD)**在存在领域偏移(domain shifts)情况下的性能下降问题。传统GCD方法假设所有数据来自单一域,但在真实场景中,未标注数据常同时面临语义变化与领域差异,导致模型泛化能力受限。解决方案的关键在于提出三种基于不同基础模型(从自监督视觉模型到视觉-语言模型)的框架:(i) HiLo通过多级特征提取与互信息最小化实现领域与语义特征解耦,并结合PatchMix增强和课程采样策略;(ii) HLPrompt在此基础上引入语义感知的空间提示调优(spatial prompt tuning),以抑制背景和领域噪声;(iii) VLPrompt则利用视觉-语言模型,通过因子化文本提示(factorized textual prompts)和跨模态一致性正则化进一步提升鲁棒性。三者共享核心设计原则——显式分离领域与语义信息,但适配不同部署场景,实验表明其在合成扰动和真实多域场景下均显著优于强基线。
链接: https://arxiv.org/abs/2605.00906
作者: Hongjun Wang,Po Hu,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submission to TPAMI
Abstract:Generalized Category Discovery (GCD) aims to categorize unlabelled instances from both known and unknown classes by transferring knowledge from labelled data of known classes. Existing methods assume all data comes from a single domain, yet real-world unlabelled data often exhibits domain shifts alongside semantic shifts. We study GCD under domain shifts and propose three frameworks that adapt foundation models, ranging from self-supervised vision models to vision-language models. (i) HiLo disentangles domain and semantic features through multi-level feature extraction and mutual information minimization, combined with PatchMix augmentation and curriculum sampling. (ii) HLPrompt extends HiLo with semantic-aware spatial prompt tuning to suppress background and domain noise. (iii) VLPrompt leverages vision-language models via factorized textual prompts and cross-modal consistency regularization. The three methods share core design principles while operating on different foundation backbones, making them suitable for different deployment scenarios. Extensive experiments on synthetic corruptions and real-world multi-domain shifts demonstrate consistent improvements over strong baselines. Project page: this https URL
[CV-204] Robustness of Transformer-Based Fluence Map Prediction Under Clinically Realistic Perturbations
【速读】:该论文旨在解决生成式 AI (Generative AI) 在调强放射治疗(Intensity-Modulated Radiation Therapy, IMRT)中用于快速预测剂量分布和射野强度图(beamlet fluence maps)时,在面对真实世界分布偏移(distribution shifts)下的鲁棒性不足问题。其解决方案的关键在于构建一个两阶段的 Transformer 架构:第一阶段从解剖结构(CT 和轮廓)预测剂量分布,第二阶段进一步预测射野强度图;同时采用物理信息损失(physics-informed loss)约束能量守恒以提升模型对几何扰动、辐射噪声、训练数据减少及域偏移等场景的适应能力。实验表明,具有层次化注意力机制的 Transformer(如 SwinUNETR)在极端扰动下表现出更稳定的性能,且仅依赖结构相似性指数(SSIM)无法有效捕捉临床相关误差,强调了引入物理一致性评估的重要性。
链接: https://arxiv.org/abs/2605.00904
作者: Ujunwa Mgboh,Rafi Ibn Sultan,Joshua Kim,Kundan Thind,Dongxiao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The Artificial Intelligence in Medicine (AIME) 2026 Conference
Abstract:Learning-based fluence map prediction offers a fast alternative to iterative inverse planning in intensity-modulated radiation therapy (IMRT), but its robustness under realistic distribution shifts remains unclear. We study a two-stage transformer pipeline that maps anatomy (CT and contours) to dose and then to beamlet fluence maps. We compare fluence-stage transformer backbones with hierarchical, global, and hybrid attention, trained with a physics-informed loss enforcing energy consistency. Robustness is evaluated under geometric perturbations, radiometric noise, reduced training data, and domain shifts using a prostate IMRT dataset, with additional evaluation of the dose stage on public datasets. Results show smooth degradation under moderate perturbations but sharp failures under severe rotations and noise. Hierarchical transformers (e.g., SwinUNETR) exhibit slower growth in upper-quartile energy error, indicating improved robustness. We further show that SSIM alone fails to capture clinically relevant errors, highlighting the need for physics-informed evaluation.
[CV-205] A Light Weight Multi-Features-View Convolution Neural Network For Plant Disease Identification
【速读】:该论文旨在解决资源受限环境下(尤其是农村地区)植物病害识别效率低下的问题,传统人工检测方法耗时费力,而现有基于深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)的模型虽然准确率高,但计算复杂度大、参数冗余,难以在边缘设备部署。解决方案的关键在于提出一种轻量级多视角卷积神经网络(Multi-View Convolutional Neural Network),通过引入额外特征增强模型判别能力,在显著减少参数数量的同时提升了分类准确性——在PlantVillage数据集上相较仅使用RGB图像训练的基础CNN模型提升2.9%的准确率,并且相比当前最优深度CNN模型具有更低的计算开销,实现了高效与高精度的平衡。
链接: https://arxiv.org/abs/2605.00903
作者: Muhammad Kaleem Ullah Khan
机构: COMSATS University Islamabad (COMSATS大学伊斯兰堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agriculture is a key sector of the economies of developing countries. It serves as a primary source of income and employment for rural populations. However, each year, a large portion of crops is wasted because of pests and diseases. Well-timed prediction of plant diseases is crucial to sustainable, high-quality agricultural production. Detection of plant diseases through conventional methods is both labour-intensive and time-consuming. Researchers have developed image classification based automated techniques for this purpose. Most accurate methods are based on deep convolutional neural networks, which are computationally intensive, with many layers and millions of trainable parameters. In resource-constrained settings, especially in rural areas, it is difficult to deploy deep convolutional neural network models for efficient plant disease identification. To address these issues, an efficient and light-weight Multi-View Convolutional Neural Network is proposed. These additional features aid the proposed model to identify the plant diseases accurately and efficiently with less number of parameters. The proposed model is tested on a benchmark Plantvillage dataset and achieves an improvement of 2.9% in classification accuracy compared to the baseline convolutional neural network model, which was trained only on Red, Green, and Blue (RGB) plant images. Compared with state-of-the-art deep convolutional neural network models, the proposed model is less computationally expensive and achieves comparable accuracy for plant disease identification on the PlantVillage dataset.
[CV-206] RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction
【速读】:该论文旨在解决不同成像协议和扫描设备导致的CT图像在噪声统计、对比度和纹理上存在显著差异的问题,从而影响肺癌筛查、诊断、治疗规划及预后评估的准确性。解决方案的关键在于提出一种基于条件均值流(conditional MeanFlow)的新型图像重建管道,其核心创新包括:1)设计一个条件MeanFlow网络,通过预测图像状态条件下的流场来建模增强轨迹,并结合均值流一致性损失与重建损失进行训练;2)引入区域强化学习驱动的策略网络,根据MeanFlow演进信息动态分配局部增强预算、停止条件和总预算,实现空间自适应的精细化增强控制。该方法通过强化学习优化奖励函数(最大化增强效果同时最小化冗余计算与不稳定性),使模型能够聚焦于难处理区域的增强,同时稳定已具高质量区域,最终在肿瘤ROI区域实现高精度重建(平均放射组学特征CCC达0.96),整体图像质量亦显著提升(平均PSNR 34.23 ± 1.71,SSIM 0.95 ± 0.01)。
链接: https://arxiv.org/abs/2605.00901
作者: Md Shifatul Ahsan Apurba,Md Selim,Jin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The use of CT imaging is important for screening, diagnosis, therapy planning, and prognosis of lung cancers. Unfortunately, due to differences in imaging protocols and scanner models, CT images acquired by different means may show large differences in noise statistics, contrast, and texture. In this study, we develop a novel conditional MeanFlow pipeline for CT image reconstruction. We introduce a conditional MeanFlow network that models the enhancement trajectory by predicting image-conditioned flow fields given intermediate image states. The image enhancement network is trained with a MeanFlow consistency loss along with the image reconstruction loss. In order to provide an adaptive refinement process in terms of spatial location of enhancements, we integrate a regional reinforcement learning-driven policy network into our approach. The policy network receives information about the MeanFlow rollouts and provides predictions in terms of tile-wise refinement budgets, stopping criteria, and total budget allocation of enhancement processes. Our policy network is trained through reinforcement learning in a policy gradient framework, where the goal of the training reward is to maximize improvement of enhancements while minimizing unnecessary computations and avoiding instabilities. In this way, our approach combines conditional flow-based enhancement with reinforcement learning-based spatial enhancement control. This allows our approach to focus more attention on enhancing difficult areas while stabilizing areas already showing sufficient quality. Our results show high accuracy in the tumor ROI, with the average radiomic feature CCC being 0.96, an average PSNR of 31.30 \pm 4.16, and average SSIM of 0.94 \pm 0.07. Moreover, there is an improvement in the overall quality of images, with an average PSNR of 34.23 \pm 1.71 and average SSIM of 0.95 \pm 0.01.
[CV-207] LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
【速读】:该论文旨在解决现有方法在进行语义数据集比较时计算成本高、鲁棒性不足的问题,尤其是当数据集中仅有极小比例(如5%至1%)图像发生语义差异时,传统基于图像描述(caption-based)的方法难以有效识别。其解决方案的关键在于提出LatentDiff框架,该框架直接在预训练视觉编码器的潜在空间(latent space)中操作,结合稀疏自编码器(sparse autoencoder)驱动的分布差异检验与密度比估计(density ratio estimation),从而以极低的计算开销实现对语义差异的可解释定位,同时在真实场景下的稀疏分布偏移(sparse distribution shifts)下保持稳定性能。
链接: https://arxiv.org/abs/2605.00899
作者: James Flora,Kowshik Thopalli,Akshay R. Kulkarni,Weng-Keen Wong,Shusen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 6 figures
Abstract:We present LatentDiff, a scalable framework for semantic dataset comparison that operates directly in the latent space of pretrained vision encoders. By combining sparse autoencoder-based divergence testing with density ratio estimation, LatentDiff identifies interpretable semantic differences between datasets at a fraction of the computational cost of caption-based alternatives. We also introduce Noisy-Diff, a benchmark capturing realistic sparse distribution shifts that cause existing methods to struggle. Experiments demonstrate that LatentDiff achieves superior accuracy while remaining robust to settings where an extremely small fraction of images (from 5% to 1% ) differ semantically.
[CV-208] When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping ICLR2026
【速读】:该论文旨在解决合成孔径雷达干涉测量(InSAR)中相位解缠(phase unwrapping)作为计算瓶颈的问题,特别是在火山与地震监测场景下,如何在保证精度的同时满足实时性要求。其核心挑战在于当前行业倾向于采用高复杂度的计算机视觉架构(如注意力机制),但未验证这些模型是否适用于具有物理约束的地球物理回归任务。解决方案的关键在于通过大规模架构消融实验(基于全球LiCSAR基准数据集)证明:简单的卷积神经网络结构——即基础U-Net(7.76M参数)——在性能上显著优于复杂的注意力模型(11.37M参数),其R²提升34%、RMSE降低51%,且推理延迟仅为2.92ms(比注意力模型快2.5倍),能够满足操作级预警系统对<100ms响应时间的要求。物理层面的功率谱密度(PSD)分析进一步揭示,注意力机制虽擅长捕捉自然图像中的边缘特征,却会引入违反弹性表面形变平滑性约束的高频伪影(0.3 cycles/pixel),从而破坏物理一致性;因此,该研究主张在机器学习用于遥感(ML4RS)中应优先采用“物理信息驱动的简约设计”。
链接: https://arxiv.org/abs/2605.00896
作者: Prabhjot Singh,Manmeet Singh
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); RediMinds Inc. (RediMinds公司); Western Kentucky University (西肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 2 tables. Oral presentation, ML4RS Workshop @ ICLR 2026
Abstract:Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant “complexity penalty”: a vanilla U-Net (7.76M parameters) achieves R^2=0.834 and RMSE = 1.01 cm, outperforming 11.37M-parameter attention-based models by 34% in R^2 and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts ( 0.3 cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a 2.5\times speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the “publication-to-practice” gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at this https URL
[CV-209] Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFM)在计算病理学中进行浸润性肿瘤团块分割时,因冻结编码器与轻量解码器之间容量不匹配而导致边界保真度不足的问题。解决方案的关键在于提出Dino-NestedUNet框架,其核心创新是将预训练的DINOv3编码器与一种嵌套密集解码器(Nested Dense Decoder)相结合,通过构建密集的中间路径网格实现特征的连续复用与多尺度再校准,从而在重建过程中有效对齐高层语义信息与低层形态纹理,显著提升边界敏感任务的分割性能。
链接: https://arxiv.org/abs/2605.00894
作者: Tianyang Wang,Ziyu Su,Abdul Rehman Akbar,Usama Sajjad,Usman Afzaal,Lina Gokhale,Charles Rabolli,Wei Chen,Anil Parwani,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.
[CV-210] When To Adapt? Adapting the Model or Data in Federated Medical Imaging
【速读】:该论文旨在解决联邦学习在医疗影像领域中因客户端间域偏移(domain heterogeneity)导致的性能下降问题。现有解决方案主要分为两类:模型侧个性化(model-side personalization),即通过调整模型参数适配每个客户端;以及数据侧协整(data-side harmonization),即在输入层面减少客户端间的差异。论文的关键发现是,这两种策略的有效性取决于域偏移的性质:当偏移以外观差异为主(如胸部X光片分类)时,协整方法更有效;而当偏移表现为结构差异(如结肠息肉分割)时,个性化策略表现更优;若客户端间差异较小,则两者效果相当。因此,选择适应策略的核心依据是域偏移的类型与强度,而非策略本身。
链接: https://arxiv.org/abs/2605.00892
作者: Chamani Shiranthika,Parvaneh Saeedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, Accepted for oral presentation and proceedings of 24th International Conference on Artificial Intelligence in Medicine, Ottawa, Canada, July 7-10, 2026
Abstract:Federated learning enables collaborative model training across medical institutions without sharing raw data, but its performance is often limited by domain heterogeneity across clients. Existing approaches to address this challenge fall into two main paradigms: model-side personalization, which adapts model parameters to each client, and data-side harmonization, which reduces inter-client variation at the input level. Despite their widespread use, these strategies have not been systematically compared. In this work, we conduct a comprehensive study across six medical imaging settings-colon polyp, skin lesion, and breast tumor segmentation, and tuberculosis CXR, brain tumor, and breast tumor classification-covering diverse types of domain shift. We evaluate a broad set of state-of-the-art harmonization and personalization methods under a unified framework. Our results reveal a conditional trade-off driven by the nature of heterogeneity: harmonization is more effective when variation is primarily appearance-based (e.g., CXR classification), while personalization performs better when differences are structural (e.g., colon polyp segmentation). When inter-client variation is limited, both strategies perform similarly. These findings demonstrate that the effectiveness of adaptation in federated medical imaging depends on the type and magnitude of domain shift rather than the strategy alone. We provide practical guidelines for selecting between harmonization and personalization and highlight directions for future hybrid approaches that combine both paradigms. Code is available at this https URL.
[CV-211] X2SAM: Any Segmentation in Images and Videos
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像和视频中缺乏像素级感知能力的问题,尤其是现有分割模型难以同时支持文本与视觉提示的交互式、跨模态语义理解。其关键解决方案是提出X2SAM,一个统一的分割型MLLM架构,通过将大语言模型(LLM)与掩码记忆模块(Mask Memory)相结合,存储受引导的视觉特征以实现时序一致的视频掩码生成;该设计不仅支持通用、开放词汇、指代表达、推理、视觉定位等多样化任务,还首次在单一接口中实现对图像和视频输入的联合处理,从而显著提升模型在复杂对话指令下的像素级分割能力。
链接: https://arxiv.org/abs/2605.00891
作者: Hao Wang,Limeng Qiao,Chi Zhang,Lin Ma,Guanglu Wan,Xiangyuan Lan,Xiaodan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
[CV-212] Skeleton-Based Posture Classification to Promote Safer Walker-Assisted Gait in Older Adults
【速读】:该论文旨在解决老年人跌倒这一重大公共卫生问题,通过提升智能助行器(smart walker)中人体行为识别的准确性来增强其在跌倒预防中的作用。解决方案的关键在于利用多种机器学习模型对助行器使用、站立与坐姿以及姿势进行分类,并验证了XGBoost和几何方法(Geometric approach)在不同任务中的优异表现:XGBoost在二分类任务中达到接近完美的训练准确率(如助行器选择为99.84%,站立vs坐姿为99.69%),而几何方法在8类姿势识别中达到89.9%的准确率,XGBoost在17类姿势识别中训练准确率达99.24%。这表明机器学习技术可显著提升人-机器人交互能力,从而优化智能助行器的实时行为监测与干预效果。
链接: https://arxiv.org/abs/2605.00890
作者: Sergio D. Sierra M.,Monica Sinha,Marcela Múnera,Carlos A. Cifuentes
机构: Bristol Robotics Laboratory (布里斯托机器人实验室); University of the West of England (西英格兰大学); Faculty of Engineering, University of Bristol (布里斯托大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Falls among older adults are a significant public health concern, leading to severe injuries, loss of independence, and increased healthcare costs. This study evaluates the effectiveness of various models, including a Geometric approach, XGBoost, SVM, and several deep learning architectures, in classifying walker usage, standing vs. sitting, and posture for smart walkers used. Geometric and XGBoost were the top performers. XGBoost achieved near-perfect training accuracy in binary classification tasks, with 99.84% for walker choice and 99.69% for standing vs. sitting. For posture classification, Geometric approach attained 89.9% accuracy for 8 postures, and XGBoost obtained 99.24% during training for 17 postures. Deep learning models such as the 4-layer CNN and Encoder-Decoder CNN also demonstrated strong performance in binary classification, with accuracies above 98%. This study underscores the potential of machine learning to enhance human-robot interaction in smart walkers, particularly for fall prevention.
[CV-213] On the explainability of max-plus neural networks
【速读】:该论文旨在解决神经网络决策过程缺乏可解释性的问题,尤其是如何量化输入特征(如图像像素)对模型输出的影响。其解决方案的关键在于利用线性最小-最大神经网络(linear-min-max neural networks)在初始化时等价于使用无穷范数(∞-norm)的k-medoids聚类结构,并通过次梯度下降进行训练;更重要的是,该模型具有单一激活神经元决定输出值的特性,从而能够精确追踪决策路径。基于此性质,作者设计了一种像素脆弱性度量(pixel fragility measure),用于判断单个像素的变化是否可能导致分类结果改变,实验表明该方法在PneumoniaMnist数据集上优于SHAP和积分梯度(Integrated Gradient)等现有解释方法。
链接: https://arxiv.org/abs/2605.00889
作者: Ikhlas Enaieh(S2A, LTCI),Olivier Fercoq(S2A, LTCI),García Ángel(DATSI, UPM)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IEEE International Symposium on Computer-Based Medical Systems (CBMS 2026), Jun 2026, Limassol, Cyprus, Cyprus
Abstract:We investigate the explanability properties of the recently proposed linear-min-max neural networks. At initialization, they can be interpreted as k-medoids with the infinity norm as a distance. Then, they are trained using subgradient descent to better fit the data. The model has been shown to be a universal approximator. Yet, we can trace the decision process because a single most activated neuron is responsible for the value of the output. Using this property, we designed a pixel fragility measure that determines whether changes to a single pixel may be responsible to a change in the classification output. Experiments on the PneumoniaMnist dataset show that this explanation for the output of the neural network compares favorably to SHAP and Integrated Gradient.
[CV-214] Selective Correlation Based Knowledge Distillation for Ground Reaction Force Estimation
【速读】:该论文旨在解决可穿戴足底传感器在人体步态分析中估计地面反作用力(Ground Reaction Force, GRF)时面临的噪声干扰与高计算资源需求问题。现有方法虽利用可穿戴设备实现便携性,但易受外部干扰影响精度,而深度学习模型虽能提升准确性却因计算复杂度高难以部署于资源受限的移动终端。其解决方案的关键在于提出一种基于选择性相关性的知识蒸馏方法(Selective Correlation Based Knowledge Distillation, SCKD),通过在提取相关性图的过程中引入时间特征选择机制,优化知识迁移过程,在保持高精度的同时显著降低模型复杂度,从而实现高效、可靠的GRF估计,适用于实时可穿戴设备上的步态分析应用。
链接: https://arxiv.org/abs/2605.00888
作者: Eun Som Jeon,Jisoo Lee,Huisu Lim,Omik M. Save,Hyunglae Lee,Pavan Turaga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:Wearable sensor-based human gait analysis holds great promise in healthcare, rehabilitation, clinical diagnosis and monitoring, and sports activities. Specifically, ground reaction force (GRF) provides essential insights into the body’s interaction with the ground during movement and is typically measured using instrumented treadmills equipped with force plates. However, such equipment is expensive and restricted to laboratory environments. To enable a more portable solution, wearable insole sensors have been used to measure GRF. These sensors, however, are prone to noise and external interference, which reduces measurement accuracy. Deep learning methodologies could be adopted to address these issues, but they often require significant computing resources to achieve high accuracy, limiting their applicability for real-time analysis on portable devices. To overcome these limitations, we propose Selective Correlation Based Knowledge Distillation (SCKD) for estimating GRF from data collected by insole sensors. Our proposed method utilizes selected features considering temporal characteristics in the process of extracting correlation maps for knowledge transfer, enhancing interpretability and mitigating issues in high dimensional data processing. We demonstrate the effectiveness of the compact models generated by our distillation framework through comparison with existing methods. Various configurations of teacher-student architectures and training approaches are examined based on multiple evaluation criteria, utilizing data collected at different walking speeds and with different window sizes. Experimental results confirm that our approach outperforms existing methods in estimating GRF from wearable insole sensor data. Therefore, our approach offers a reliable and resource-efficient solution for human gait analysis.
[CV-215] SparseContrast: Dynamic Sparse Attention for Efficient and Accurate Contrastive Learning in Medical Imaging
【速读】:该论文旨在解决医学影像中对比学习(contrastive learning)在低数据场景下计算效率低、冗余信息处理严重的问题。传统方法依赖密集注意力机制(dense attention),不仅计算开销大,且常关注非诊断相关区域,影响模型效率与准确性。其解决方案的关键在于提出SparseContrast框架,通过引入动态稀疏注意力机制(dynamic sparse attention),自适应地聚焦于具有临床意义的诊断区域,同时结合一个紧凑的显著性预测器(saliency predictor)在训练过程中优化注意力图的稀疏性和特征质量。此设计在保持甚至提升诊断准确率的同时,使训练和推理速度相比密集注意力基准提升最高达40%,且对骨干网络架构不敏感,适用于卷积神经网络与Transformer模型。
链接: https://arxiv.org/abs/2605.00887
作者: Paarth Prasad,Ruchika Malhotra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose SparseContrast, a new framework that merges dynamic sparse attention with contrastive learning for medical imaging, with a focus on chest X-ray disease detection in low-data settings. Traditional contrastive learning methods rely on dense attention mechanisms, which are computationally expensive and often process redundant regions in medical images. To resolve this, SparseContrast introduces a sparse attention mechanism that selectively concentrates on diagnostically pertinent areas, markedly decreasing computational burden without compromising accuracy. The framework adaptively trims attention maps in the training phase, directed by a compact saliency predictor which concurrently optimizes sparsity and feature quality. This method not only speeds up training and inference by as much as 40% relative to dense attention benchmarks but also boosts diagnostic accuracy by focusing on areas of clinical importance. Moreover, the approach remains indifferent to the selection of backbone architecture, which permits its application to both convolutional and transformer-based models. Experiments show SparseContrast attains comparable or better performance in disease identification tasks with greater efficiency relative to current approaches. The proposed framework delivers a practical approach for implementing contrastive learning in medical imaging settings with limited resources, where computational efficiency and diagnostic accuracy are paramount.
[CV-216] Selective Attention-Based Network for Robust Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因目标尺寸极小、信杂比低以及背景结构复杂导致的误报率高和检测精度不足的问题。现有基于编码器-解码器架构的方法受限于早期卷积阶段的信息瓶颈及静态跳跃连接无法动态区分真实目标与伪目标区域。解决方案的关键在于提出SANet网络,其核心创新为两个模块:(1) 双路径语义感知模块(Dual-path Semantic-aware Module, DSM),融合标准卷积与螺旋形卷积以增强局部细节保留和方向敏感的感受野,并通过卷积块注意力模块(Convolutional Block Attention Module, CBAM)实现空间-通道特征精细化重校准;(2) 选择性注意力融合模块(Selective Attention Fusion Module, SAFM),以可学习的空间自适应加权机制替代传统静态跳跃连接,实现跨尺度特征的上下文感知融合,从而提升对微弱小目标的识别能力与抗干扰性能。
链接: https://arxiv.org/abs/2605.00886
作者: Yingming Zhang,Wuqi Su,Qing Xiao,Yonggang Yang
机构: Zhejiang Gongshang University (浙江工商大学); Tiangong University (天津工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) plays a pivotal role in a broad spectrum of mission-critical applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes, all of which demand the precise identification of dim, sub-pixel targets amid highly cluttered infrared backgrounds. Despite significant progress driven by deep learning methods, fundamental challenges persist: infrared small targets occupy extremely limited spatial extents (often only a few pixels), exhibit low signal-to-clutter ratios, and are easily confused with structurally complex backgrounds that frequently induce false alarms. Existing encoder-decoder architectures suffer from two key limitations - an information bottleneck in early convolutional stages that undermines fine-grained target perception, and static skip connections that lack the dynamic adaptability required to discriminate between genuine targets and pseudo-target regions. To address these challenges, we propose SANet, a Selective Attention-based Network built upon the classical U-Net framework and augmented with two novel components: (1) a \emphDual-path Semantic-aware Module (DSM) that integrates standard convolutions for local spatial detail preservation with pinwheel-shaped convolutions for expanded, direction-sensitive receptive fields, followed by a Convolutional Block Attention Module (CBAM) for fine-grained spatial-channel feature recalibration; and (2) a \emphSelective Attention Fusion Module (SAFM) that replaces conventional static skip connections with a spatially adaptive, learnable weighting mechanism to perform context-aware, cross-scale feature fusion.
[CV-217] Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion
【速读】:该论文旨在解决非均匀雾霾图像(non-homogeneous hazy images)去雾难题,这类图像具有空间变化的雾霾浓度和区域间突变的密度过渡,传统单幅图像去雾方法在此类场景下性能显著下降。解决方案的关键在于提出一种多分支深度神经网络框架——浓度分区与图像融合网络(Concentration Partitioning and Image Fusion Network, CPIFNet),其核心思想是将复杂的非均匀去雾问题分解为若干可处理的局部同质子问题。CPIFNet采用两阶段架构:第一阶段通过多个独立训练的图像增强网络(IENet)分支分别针对不同雾霾浓度水平的同质数据集进行优化,从而获得对特定密度区域具有良好恢复能力的模型;第二阶段利用图像融合网络(IFNet)通过深层特征堆叠与融合策略,智能整合各分支输出的优势区域,最终生成统一高质量的去雾结果。
链接: https://arxiv.org/abs/2605.00885
作者: Yingming Zhang,Wuqi Su,Qing Xiao,Yonggang Yang
机构: Zhejiang Gongshang University (浙江工商大学); Tiangong University (天津工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing single image dehazing methods have demonstrated satisfactory performance on homogeneous thin-haze images; however, they often struggle with non-homogeneous hazy images that exhibit spatially varying haze concentrations and abrupt density transitions across different regions. To address this fundamental limitation, we propose a novel multi-branch deep neural network framework, termed Concentration Partitioning and Image Fusion Network (CPIFNet), which decomposes the challenging non-homogeneous dehazing problem into a set of tractable homogeneous sub-problems. Our key insight is that a single non-homogeneous hazy image can be viewed as a composite of multiple local regions, each exhibiting approximately homogeneous haze characteristics. CPIFNet employs a two-stage architecture consisting of an Image Enhancement Network (IENet) stage and an Image Fusion Network (IFNet) stage. In the first stage, multiple IENet branches are independently trained on homogeneous haze datasets of different concentration levels, producing enhancement models that excel at restoring regions matching their respective haze densities. In the second stage, the IFNet intelligently aggregates the advantageous regions from all enhancement outputs through deep feature stacking and merging, yielding a unified high-quality dehazed result. Furthermore, we introduce a comprehensive loss function incorporating reconstruction, perceptual, structural, and color losses to jointly supervise both stages.
[CV-218] LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception
【速读】:该论文旨在解决将视觉-语言-动作(Vision-Language-Action, VLA)模型部署于无人机等资源受限边缘设备时面临的低延迟闭环引导难题。现有VLA模型虽在操作任务中表现出良好的语义接地和泛化能力,但在空中场景下因严格的计算与通信约束难以实时运行。解决方案的关键在于设计了一个轻量级、参数量为256M的VLA系统LiteVLA-H,采用双速率工作机制:快节奏外环引导模式以约50.65ms(19.74Hz)输出动作令牌,同时慢速语义模式维持场景理解与风险描述能力(149.90–164.57ms,6.08–6.67Hz)。核心发现是,在紧凑边缘环境下,端到端延迟主要由多模态预填充(multimodal pre-fill)主导,而非解码少量额外token的边际成本,因此提出一种调度策略实现反应式动作输出与周期性语义感知并存,并通过知识保留微调方法融合飞行数据、空中语义数据及通用图像描述/视觉问答监督,从而在不牺牲描述能力的前提下优化边缘推理效率。
链接: https://arxiv.org/abs/2605.00884
作者: Justn williams,Kishor Datta Gupta,Roy George,Mrinmoy Sarkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90–164.57\ms (6.08–6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.
[CV-219] owards High Fidelity Face Swapping: A Comprehensive Survey and New Benchmark
【速读】:该论文旨在解决当前人脸交换(Face Swapping)技术研究中存在的方法碎片化、评估标准不统一以及缺乏系统性分析的问题。现有方法分散在不同范式中,且由于缺少标准化数据集和评测协议,导致性能比较难以公平开展;同时,以往综述多聚焦于更广泛的深度伪造(Deepfake)生成或检测,未能将人脸交换作为一个独立问题进行深入探讨。解决方案的关键在于:首先,对现有方法按设计原理归纳为五大范式并系统分析其优劣;其次,构建高质量基准数据集CASIA FaceSwapping,该数据集具有平衡的人口统计分布和明确的属性变化,支持可控实验;最后,建立标准化评测协议以客观衡量各类方法的鲁棒性。这一框架为推动更鲁棒、可控制的人脸交换技术发展提供了统一视角与科学评估基础。
链接: https://arxiv.org/abs/2605.00883
作者: Qi Li,Weining Wang,Shuangjun Du,Bo Peng,Jing Dong,Kun Wang,Zhenan Sun,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Face swapping has witnessed significant progress in recent years, largely driven by advances in deep generative models such as GANs and diffusion this http URL these advances, existing methods remain fragmented across different paradigms, and their evaluation is highly inconsistent due to the lack of standardized datasets and protocols. Moreover, prior surveys primarily focus on broader deepfake generation or detection, leaving face swapping insufficiently studied as a standalone problem. In this paper, we present a comprehensive survey and benchmark for face swapping. We provide a structured review of existing methods, organizing them into five major paradigms and systematically analyzing their design principles, strengths, and limitations. To enable fair and controlled evaluation, we introduce CASIA FaceSwapping, a high-quality benchmark with balanced demographic distributions and explicit attribute variations, and establish standardized protocols to assess the robustness of different face swapping methods. Extensive experiments on representative approaches yield new insights into the performance characteristics and limitations of current techniques. Overall, our work provides a unified perspective and a principled evaluation framework to facilitate the development of more robust and controllable face swapping methods. More results can be found at this https URL.
[CV-220] Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography
【速读】:该论文旨在解决远程光电容积脉搏波描记法(remote Photoplethysmography, rPPG)中自监督学习(Self-Supervised Learning, SSL)方法普遍存在的“相关性陷阱”问题:即模型倾向于学习数据中能量最强的伪信号(如运动噪声或光照变化),而非微弱的真实rPPG生理信号,导致泛化性能差。其解决方案的关键在于提出一种新的自监督范式——生理因果探测(Physiological Causal Probing, PCP),将潜在的rPPG信号视为物理源,像素色度变化视为其视觉表现,并通过主动干预机制验证假设的物理合理性。具体实现为Interv-rPPG框架,其中PhysMambaFormer提取rPPG假设,可控生理信号编辑器(Controllable Physiological Signal Editor)基于该假设在视频低频色度域进行精准干预,利用“归零可证伪性”(Falsifiability via Nulling)和“公理等变性”(Axiomatic Equivariance)双重验证机制,显著提升模型对运动与光照干扰的鲁棒性,同时在跨数据集场景下优于监督基线。
链接: https://arxiv.org/abs/2605.00882
作者: Zhiyi Niu,Xiaoguang Tu,Bo Zhao,Junzhe Cao,Dan Guo,Zitong Yu
机构: Hong Kong University of Science and Technology (香港科技大学); Harbin Institute of Technology (哈尔滨工业大学); Hefei University of Technology (合肥工业大学); Guangdong University (广东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote Photoplethysmography (rPPG) enables convenient non-contact physiological measurement. Existing Self-Supervised Learning (SSL) methods commonly fall into a correlation trap: they tend to learn the most dominant periodic signals in the data, such as high-energy motion or illumination noise, rather than the faint, true rPPG signal, leading to poor model generalization. To address this, we propose a new SSL paradigm, Physiological Causal Probing (PCP), which treats the latent rPPG signal as the underlying physical source and the resulting pixel chrominance variations as its visual manifestation. Its core idea is to shift from passive correlation learning to active, precise intervention: it intervenes on the video based on a proposed rPPG hypothesis, and verifies whether the post-intervention changes match physical expectations. We propose the Interv-rPPG framework to implement PCP: an rPPG extractor named PhysMambaFormer hypothesizes the rPPG signal, while a Controllable Physiological Signal Editor conducts precise chrominance-domain interventions on videos based on this hypothesis. Interv-rPPG validates the physical realism of the hypothesis through Falsifiability via Nulling' and Axiomatic Equivariance’. Our editor achieves precise editing of the rPPG signal by intervening in the low-frequency chrominance components of the video. Our method improves both in-domain and cross-domain performance on challenging datasets such as VIPL-HR and MMPD. Furthermore, it surpasses the supervised baseline in complex cross-dataset settings, while remaining competitive on clean datasets where the intervention mechanism may introduce slight residual chrominance noise. Extensive experiments, including diagnostic analysis of nuisance sensitivity, demonstrate that the PCP paradigm effectively resists motion and illumination artifacts.
[CV-221] Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶(End-to-End Autonomous Driving, E2E AD)模型在面对视觉上不可察觉的对抗扰动时所表现出的共性脆弱性问题,尤其是基于Transformer架构的模型容易被操纵导致危险驾驶行为。现有对抗攻击方法多依赖白盒或黑盒假设,存在模型透明度要求高、查询延迟大或迁移能力弱等局限。解决方案的关键在于提出一种新型灰盒攻击框架——对抗流匹配(Adversarial Flow Matching, AFM),其核心创新是通过神经平均速度场(neural average velocity field)实现单步高效生成对抗样本,并协同扰动生成潜在空间与神经平均速度场,从而在保证高度视觉隐蔽性的前提下显著降低VLA和模块化AD代理的性能,同时展现出优异的跨模型迁移能力,逼近黑盒攻击效果但仅需目标模型包含Transformer模块这一先验知识。
链接: https://arxiv.org/abs/2605.00880
作者: Xinyu Zeng,Xiangkun He,Lei Tao,Chen Lv,Hong Cheng
机构: Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China (深圳先进研究院,电子科技大学); Nanyang Technological University (南洋理工大学); School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China (机械与电气工程学院,电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures
Abstract:Autonomous driving (AD) is evolving towards end-to-end (E2E) frameworks through two primary paradigms: monolithic models exemplified by Vision-Language-Action (VLA), and specialized modular architectures. Despite their divergent designs, both paradigms increasingly rely on Transformer backbones for complex reasoning, potentially causing a shared vulnerability: visually imperceptible perturbations can manipulate E2E AD models into hazardous maneuvers by targeting the Transformer module. Most existing adversarial attack approaches against AD systems operate under white-box or black-box settings; yet, they typically necessitate full model transparency, or suffer from either prohibitive query latency or limited attack transferability. In this paper, we propose Adversarial Flow Matching (AFM), a novel gray-box attack framework that exploits Transformer structural vulnerabilities in E2E AD models. AFM enables efficient one-step generation of adversarial examples via a neural average velocity field. Additionally, the proposed technique yields effective and visually imperceptible attacks by synergistically perturbing the generative latent space and the neural average velocity field. Extensive experiments demonstrate that AFM achieves a superior trade-off between attack effectiveness and imperceptibility: it substantially degrades the performance of both VLA and modular AD agents across various scenarios compared to baselines, while maintaining state-of-the-art visual imperceptibility. Furthermore, adversarial examples generated by AFM exhibit robust cross-model transferability, indicating that AFM closely approximates a black-box attack setting while requiring only the prior knowledge that the target AD model incorporates a Transformer-based module.
[CV-222] Single Image Defogging Using a Fourth-Order Telegraph PDE Guided by Physical Haze Modeling
【速读】:该论文旨在解决真实场景下图像去雾(image defogging)这一逆问题,其核心挑战在于场景深度未知、大气散射效应复杂以及缺乏真实标签(ground truth)。为应对这些问题,作者提出了一种融合四阶非线性偏微分方程(PDE)与物理雾霾形成模型的混合去雾方法。解决方案的关键在于:首先利用暗通道先验(Dark Channel Prior, DCP)估计大气光和透射率,并生成引导图像;随后通过一个基于四阶 Telegraph 型 PDE 的演化过程进行图像恢复,其中引入边缘自适应扩散系数和由透射图加权的保真项,从而在有效抑制雾霾的同时保留结构细节。该方法借助相对误差范数判断 PDE 收敛性,实验表明其在视觉质量与结构保持方面优于传统 DCP、改进 DCP 及变分法等单幅图像去雾技术。
链接: https://arxiv.org/abs/2605.00878
作者: Manish Kumar,Rajendra K. Ray
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In real-world scenarios, image defogging is an inverse problem due to unknown scene depth, atmospheric scattering, and the common absence of ground truth . To resolve the issue, we propose a hybrid defogging model that integrates a fourth-order nonlinear PDE with a physical haze formation model. We used Dark Channel Prior to estimate atmospheric parameters and to generate a guidance image, while the final restoration is performed via a fourth-order PDE-based evolution. A fourth-order PDE of the type telegraph is then evolved, incorporating an edge-adaptive diffusion coefficient and a fidelity term weighted by the transmission map. Fourth-order diffusion effectively suppresses haze while preserving structural details, and the hyperbolic formulation improves numerical stability and convergence behavior. We use relative error norm criteria for the convergence of our PDE. The proposed method is compared with Dark Channel prior, modified Dark Channel prior, and variational-based single-image defogging techniques. When we have ground truth available, we use MSE and SSIM for quantitative evaluation, whereas no-reference metrics, including FADE, Contrast Restoration Index, Average Gradient, and Entropy, are applied to real-world foggy images. Experimental results demonstrate that the proposed hybrid PDE-based method provides comparable visual quality and maintains structural details.
[CV-223] GAZE: Grounded Agent ic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在医学影像诊断中缺乏迭代推理能力的问题,即现有模型通常仅通过单次前向传播完成图像理解与文本生成,而临床实践中放射科医生会多次观察图像并结合文献进行诊断。解决方案的关键在于提出GAZE(Grounded Agentic Zero-shot Evaluation)框架,该框架使医学VLM能够调用多类工具(如缩放、窗宽调整、边缘检测等 viewer-level 工具以及基于美国国家医学图书馆的PubMed和Open-i的检索工具),并通过结构化输出验证和完整工具调用日志实现可审计性,从而模拟人类专家的渐进式诊断过程。此设计显著提升了罕见病的定位和诊断性能,尤其在无任务特定微调条件下实现了58.2% mAP@0.3的病变定位准确率和34.9% Top-1诊断准确率,证明了工具调用机制与结构化推理流程对提升医学VLM泛化能力的重要性。
链接: https://arxiv.org/abs/2605.00876
作者: Duaa Alim,Mogtaba Alim,Liam Chalcroft
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that scores captioning, diagnosis, and localisation from the image alone, without task-specific fine-tuning. Before any tool is used, structured prompting and schema-validated outputs already improve over the published Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), so framework design is itself an experimental variable. Tool use helps rare pathologies disproportionately: the fraction of cases with IoU 0.3 rises from 17% to 58% for diagnoses with three or fewer examples versus 25% to 68% for common conditions ( \geq 10 cases), with gains tracking engagement (Gemini 3 Flash: Cohen’s d = 0.79, 11.8 tool calls per case; Gemini 2.0 Flash: tools used in 8.2% of cases, no significant benefit). Retrieval ablations additionally reveal a model-dependent trade-off in which gains in diagnosis can coincide with losses in localisation, reinforcing the case for joint evaluation of diagnosis, localisation, and captioning in medical VLMs.
[CV-224] Visual Chart Representations for Cryptocurrency Regime Prediction: A Systematic Deep Learning Study
【速读】:该论文旨在解决如何有效利用视觉表示方法对加密货币市场状态(regime)进行分类的问题,以提升技术分析中基于图表的预测准确性。其关键解决方案在于系统性地比较多种图像编码方式(原始蜡烛图、Gramian Angular Fields 和多通道 GAF)、不同图表组件配置及神经网络架构,并验证了在金融图表场景下,简单的 4 层卷积神经网络(CNN)直接处理原始蜡烛图(仅价格信息,分辨率 128×128)即可达到最优性能(AUC-ROC=0.892),显著优于更复杂的预训练模型(如 ResNet18、EfficientNet-B0 和 Vision Transformer)。此外,尽管存在自然图像与金融图表之间的领域差异,ImageNet 预训练仍能带来 4–16% 的性能提升,表明迁移学习在该任务中具有重要价值。
链接: https://arxiv.org/abs/2605.00875
作者: Dustin M. Haggett
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 9 tables. Stevens Institute of Technology course project, Fall 2025
Abstract:Technical traders have long relied on visual analysis of candlestick charts to identify market patterns and predict price movements. While deep learning has achieved remarkable success in image classification, its application to financial chart images remains underexplored. This paper presents a systematic study comparing different visual representations for cryptocurrency regime prediction. We evaluate three image encoding methods (raw candlestick charts, Gramian Angular Fields, and multi-channel GAF), five chart component configurations, four neural network architectures (CNN, ResNet18, EfficientNet-B0, and Vision Transformer), and the impact of ImageNet transfer learning. Through eight controlled experiments on Bitcoin, Ethereum, and SP 500 data spanning 2018-2024, we identify optimal configurations for visual regime classification. Our results show that a simple 4-layer CNN on raw candlestick charts achieves 0.892 AUC-ROC, outperforming larger pretrained models. Surprisingly, simpler representations (price-only charts, 128x128 resolution) consistently outperform more complex alternatives. We provide interpretability analysis using GradCAM and demonstrate that transfer learning improves performance by 4-16% despite the domain gap between natural images and financial charts.
[CV-225] Latent Space Probing for Adult Content Detection in Video Generative Models DSN
【速读】:该论文旨在解决生成式 AI(Generative AI)视频生成系统中成人及性暴露内容的实时检测难题,现有方法仅依赖提示(prompt)或解码后的像素空间输出,无法利用生成过程中形成的丰富内部表征。其解决方案的关键在于提出一种新颖的潜在空间探测(latent space probing)框架,通过拦截CogVideoX视频扩散模型在推理阶段产生的去噪潜在表示,并附加轻量级分类器实现对成人内容的实时检测。实验表明,潜在空间信号包含强判别性特征,在自建的大规模二分类数据集上达到97.29%的F1分数,且延迟仅增加4–6ms,显著提升了检测性能与计算效率。
链接: https://arxiv.org/abs/2605.00874
作者: Alizishaan Khatri,Chiquita Prabhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: To be published in 2026 56th Annual IEEE International Conference on Dependable Systems and Networks Workshops (DSN-W)
Abstract:The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.
[CV-226] BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型缺乏可靠、全面评估方法的问题,尤其针对现有基准测试忽视不合理提示(implausible prompting)和音频-视觉一致性(audio-visual alignment)的缺陷。解决方案的关键在于提出首个统一框架BRITE,其核心创新包括:(1)引入不合理提示以检验模型在分布外(off-manifold)场景下的鲁棒性;(2)提供细粒度的音频-视觉一致性评估;(3)采用基于问答(QA-based)的可解释性评价机制。与依赖多模态大语言模型(Multimodal LLM)的全自动流水线不同,BRITE通过严格的人工参与(human-in-the-loop)协议确保评估结果的可靠性,从而精准识别下一代T2V模型在物体-动作绑定和音画同步等方面的性能瓶颈。
链接: https://arxiv.org/abs/2605.00873
作者: Advait Tilak,Jiwon Choi,Nazifa Mouli,Wei Le
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
[CV-227] Synthetic Designed Experiments for Diagnosing Vision Model Failure CVPR
【速读】:该论文旨在解决当前计算机视觉合成数据(synthetic data)生成流程中存在的根本性问题:现有方法将合成数据视为廉价的真实数据,随机采样生成器输出空间,未能针对性地诊断下游模型的实际需求,导致无法有效覆盖模型的失败模式。其解决方案的关键在于引入基于实验设计(Design of Experiments, DoE)理论的“合成设计实验”(Synthetic Designed Experiments for Representational Sufficiency, SDRS),将下游模型视为黑箱系统,合成生成器作为实验装置,利用分数因子设计(fractional factorial designs)通过方差分析(ANOVA)分解高效审计模型对各因素的敏感性分布,并据此分类两种可操作的失败类型:I型缺口(因子水平覆盖不足)和II型缺口(依赖伪相关干扰因素)。该审计机制进一步指导生成针对性合成数据以修复对应缺口,从而提升模型鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2605.00832
作者: Krisanu Sarkar
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review at CVPR SynData4CV 2026
Abstract:Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. This open-loop paradigm treats synthetic data as cheap real data, randomly sampling the generator’s output space and hoping to cover the model’s failure modes. We argue this fundamentally misuses synthetic data’s unique property: the controllable, independent variation of scene this http URL on the statistical theory of Design of Experiments (DoE), we propose Synthetic Designed Experiments for Representational Sufficiency (SDRS). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model’s factor-sensitivity profile via ANOVA decomposition. It classifies failures into two actionable types: Type I gaps (coverage failures on underrepresented factor levels) and Type II gaps (reliance on spurious nuisance dependencies). The audit then prescribes targeted synthetic data to address each gap type. We validate SDRS on three experiments: (1) a controlled diagnostic on dSprites with planted biases, where the audit correctly identifies both gap types and targeted data improves accuracy from 49.9% to 79.0%; (2) a dense segmentation task on procedural scenes, where detecting background-complexity shortcuts and applying targeted data improves mIoU from 0.948 to 0.998; and (3) an entanglement detection experiment showing that the ANOVA audit identifies cross-factor contamination in imperfect generators. Finally, we show that per-factor invariance penalties can transfer sensitivity between factors, identifying an open problem for representation-level correction.
[CV-228] Biological Spatial Priors Regularize Foundation Model Representations for Cross-Site MSI Generalization in Colorectal Cancer
【速读】:该论文旨在解决基于常规苏木精-伊红染色(Hematoxylin and Eosin, HE)全切片图像(Whole Slide Images, WSI)预测微卫星不稳定性(Microsatellite Instability, MSI)状态时,模型在不同医疗机构间泛化能力差的问题。其关键解决方案是引入基于生物学机制的局部空间先验(spatial priors),具体包括两种编码方式:一是基于肿瘤侵袭边缘周围淋巴细胞反应的“外围距离编码”(peripheral distance encoding),反映Crohn样周边淋巴细胞浸润;二是局部免疫微环境编码(local immune neighborhood encoding),量化每个图块(tile)邻域内淋巴细胞与肿瘤细胞的比例。这两种先验被注入TransMIL聚合器中,在自注意力机制前整合到UNI2-h或Virchow2特征中,从而引导模型学习更具跨机构一致性的生物特征表示,减少对站点特异性成像模式的依赖,显著提升外部验证集上的AUC和特异性。
链接: https://arxiv.org/abs/2605.02660
作者: Dasari Naga Raju
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting microsatellite instability (MSI) status from routine hematoxylin and eosin (HE) whole slide images (WSIs) offers a practical alternative to molecular testing, but models trained at one institution tend to generalize poorly to slides acquired at a different site. Foundation model representations, despite their generality, still encode site-specific texture alongside the conserved biological morphology underlying MSI. We investigate whether tile-level spatial priors derived from known MSI histology can guide these representations toward more site-invariant features. We introduce a biologically motivated spatial prior based on peripheral distance encoding, reflecting the Crohn’s-like peripheral lymphocytic reaction at the tumor invasive margin, and evaluate a secondary local immune neighborhood encoding reflecting the lymphocyte-to-tumor ratio in each tile’s immediate spatial neighborhood. Both priors are injected into a TransMIL aggregator before self-attention, allowing the transformer to integrate spatial biological context with UNI2-h or Virchow2 features across all attention layers. We evaluate six foundation model and MIL aggregator combinations as a reference, then assess the effect of each spatial prior. Training on TCGA-COAD (137 slides) and evaluating externally on TCGA-READ (50 slides) without retraining, peripheral distance encoding achieves MSI AUC 0.959 +/- 0.012 on COAD and MSS specificity 1.000 on READ, compared to 0.957 and 0.939 for the strongest reference configuration. Local immune neighborhood encoding achieves comparable internal AUC but lower cross-site specificity, suggesting margin proximity encodes a more site-invariant biological signal than local immune density. Results suggest biologically grounded spatial priors act as regularizers that reduce reliance on site-specific imaging patterns.
[CV-229] Continuous quantification of viral plaque dynamics using ultra-large-area label-free imaging enables rapid antiviral susceptibility testing
【速读】:该论文旨在解决传统病毒斑减少试验(plaque reduction assay, PRA)在抗病毒药物敏感性检测中存在的时间耗时长、人工操作繁琐、易出错且无法动态观测病毒抑制动力学的问题。其核心解决方案是开发了一种无标记、时间分辨的PRA平台,关键在于集成紧凑型无透镜成像系统与定制超大面积(100 cm²)薄膜晶体管(thin-film transistor, TFT)图像传感器,并结合深度学习算法,在培养箱内实现对病毒斑形成单位(plaque-forming units, PFUs)动态变化的自动量化。该方案不仅将读取时间缩短约26小时,还揭示了药物浓度依赖的时序延迟和新病毒斑生成抑制特征,从而在约60小时感染后即可得出明确的药物疗效评估,为病毒学研究、高通量药物筛选及临床诊断提供了可扩展、信息丰富的新型测量框架。
链接: https://arxiv.org/abs/2605.01738
作者: Merve Eryilmaz,Yuzhu Li,Xiao Wang,Max Zhang,Alp Inegol,Zixiang Ji,Lucas Thai,Guangdong Ma,Akihiko Fujisawa,Kazunori Yamaguchi,Aydogan Ozcan
机构: 未知
类目: Applied Physics (physics.app-ph); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph); Instrumentation and Detectors (physics.ins-det)
备注: 42 Pages, 7 Figures
Abstract:The plaque reduction assay (PRA) remains the gold standard for antiviral susceptibility testing, evaluating drug potency by measuring reductions in plaque-forming units (PFUs). However, the traditional PRA is time-consuming, labor-intensive, prone to manual counting errors, and offers limited scalability. Moreover, its reliance on destructive fixation and chemical staining reduces the assay to a static, endpoint observation, obscuring the dynamic, time-resolved kinetics of dose-dependent viral inhibition. Here, we introduce a label-free, time-resolved PRA platform that transforms the conventional assay into a continuous, high-dimensional measurement of viral infection dynamics. Our system integrates a compact lens-free imaging setup with a custom-designed ultra-large-area (100 cm^2) thin-film transistor (TFT) image sensor and deep learning-based algorithms to autonomously quantify PFU dynamics within an incubator. Validated using herpes simplex virus type-1 (HSV-1) treated with acyclovir, the platform matched chemically-stained ground truth measurements with zero false positives while accelerating readout by ~26 hours. Crucially, our system revealed that increasing drug concentrations induce temporally distinct delays and suppress new PFU formation, enabling conclusive drug efficacy evaluations within ~60 hours post-infection. This scalable, label-free framework redefines antiviral susceptibility testing as a rapid, time-resolved and information-rich measurement framework, providing a generalizable platform for virology research, high-throughput drug screening, and clinical diagnostics.
[CV-230] Quaternion Nonlinear Transform-Induced Nuclear Norm for Low-Rank Tensor Completion
【速读】:该论文旨在解决现有低秩张量补全方法在处理四元数张量(quaternion tensor)时的局限性,特别是针对彩色图像和视频中通道间依赖关系无法有效建模的问题。传统基于线性变换的张量核范数(TNN)方法虽能利用低秩结构恢复缺失数据,但其对非线性结构的捕捉能力有限;而现有的非线性变换张量核范数(NTTNN)方法仅适用于实值张量,难以扩展至四元数域,主要受限于四元数乘法的不可交换性及四元数奇异值分解(quaternion singular value decomposition, QSVD)的复杂性。解决方案的关键在于提出一种基于实嵌入(real embedding)的四元数非线性变换诱导张量核范数(QNTTNN),通过将四元数映射到实空间来定义可计算的核范数并实现高效优化,进而构建出适用于四元数张量补全的模型,并设计了具有严格收敛保证的近端交替最小化算法。
链接: https://arxiv.org/abs/2605.01467
作者: Biswarup Karmakar,Ratikanta Behera
机构: Indian Institute of Science (印度科学研究所)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 25 pages
Abstract:Tensor completion has emerged as a powerful framework for recovering missing data in multidimensional signals by exploiting low-rank tensor structures. Among existing approaches, linear transform-based tensor nuclear norm (TNN) methods have achieved considerable success by enforcing low-rankness on transformed frontal slices. However, the low-rank structure revealed by linear transforms remains inherently limited. To better capture intrinsic correlations, nonlinear transform-based TNN (NTTNN) models have been proposed, significantly enhancing low-rank representation through composite transforms. Despite their effectiveness, existing NTTNN methods are restricted to real-valued tensors and fail to model quaternion-valued data, which are essential for preserving inter-channel dependencies in color images and videos. Extending nonlinear TNN models to the quaternion domain is challenging due to the non-commutativity of quaternion multiplication and the complexity of quaternion singular value decomposition. To address the limitations encountered in prior works, we propose a quaternion nonlinear transform-induced tensor nuclear norm (QNTTNN) via a real embedding of quaternions, enabling tractable nuclear norm definitions and efficient optimization. Building upon QNTTNN, we formulate a quaternion tensor completion model and develop a proximal alternating minimization algorithm with rigorous convergence guarantees. Extensive experiments on benchmark color video inpainting datasets validate the superior performance of the proposed method over existing approaches.
[CV-231] Reconstruction Interval Z-Phase Dependence of AI Detection Sensitivity in CT Lung Nodule Screening
【速读】:该论文旨在解决深度学习驱动的肺结节检测系统在不同CT扫描参数下检测敏感性差异的问题,特别是尚未被充分研究的结节位置(z相位)对检测概率的影响。其关键发现是:AI检测敏感性显著依赖于重建间隔与结节直径之比(d/D),当该比值接近或超过1.0时(如3–6mm结节在5mm重建间隔下),z相位成为单次扫描中检测变异的主要来源,且这种随机效应无法通过协议级质量指标捕捉,也无法体现在AI置信度评分中。
链接: https://arxiv.org/abs/2605.00971
作者: Dan Soliman
机构: GammaMetric Medical Physics, Independent Medical Physics Consultancy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background: Sensitivity of AI-assisted lung nodule detection systems is known to vary with CT acquisition parameters including radiation dose, reconstruction kernel, and slice thickness. However, the dependence of detection probability on nodule position within the reconstruction cycle – the z-phase – has not, to the author’s knowledge, been characterized for deep learning-based detection systems. Methods: A retrospective analysis was performed using the LIDC-IDRI dataset. Detection results from a previously validated 154-case perturbation study were re-analyzed. For each consensus nodule (=4-reader agreement), z-phase was defined as the fractional position of the nodule center within the reconstruction cycle, folded to [0, 0.5]. Detection sensitivity was stratified by z-phase bin, reconstruction interval (1mm, 3mm, 5mm), and by the ratio of reconstruction interval to nodule diameter (d/D). Results: At 5mm reconstruction interval, sensitivity was 71.6% vs 84.8% at 1mm baseline. Within the 5mm condition, sensitivity varied by 17.6 percentage points across z-phase bins. Stratified by d/D ratio, sensitivity was 92.4% for d/D 0.5, 78.0% for 0.5 = d/D 1.0, and 61.4% for d/D = 1.0, with a systematic z-phase effect present only in the d/D = 1.0 stratum. Conclusions: AI detection sensitivity depends on the ratio of reconstruction interval to nodule diameter. When this ratio approaches or exceeds 1.0 – as occurs for 3-6mm nodules at 5mm reconstruction – z-phase becomes the dominant source of per-study detection variance. This stochastic effect is invisible to protocol-level quality metrics and not reflected in AI confidence scores.
[CV-232] A Proof-of-Concept Study of Multitask Learning for Cranial Synthetic CT Generation Across Heterogeneous MRI Field Strengths
【速读】:该论文旨在解决跨场强(field strength)和成像协议(acquisition protocol)条件下磁共振成像(MRI)到计算机断层扫描(CT)图像合成的泛化能力不足问题,这在颅脑应用中尤为关键,如衰减校正、放疗计划制定和图像引导干预。其解决方案的核心在于将CT合成建模为一个模块化且结构耦合的问题,并提出一种深度学习框架,通过设计可适应不同MRI条件的机制来保持解剖一致性,从而显著提升模型在多中心异构MRI数据上的性能与鲁棒性。
链接: https://arxiv.org/abs/2605.00923
作者: Zhuoyao Xin,Yiren Zhang,Christopher Wu,Dong Liu,Chunming Gu,Elena Greco,Erik H. Middlebrooks,Jun Hua,Jia Guo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Medical Physics (2026). DOI: https://doi.org/10.1002/mp.70429
Abstract:Accurate synthesis of computed tomography (CT) images from magnetic resonance imaging (MRI) is clinically valuable for cranial applications such as attenuation correction, radiotherapy planning, and image-guided interventions. However, heterogeneity across MRI field strengths and acquisition protocols limits the generalizability of existing methods. In this study, we formulate cranial CT synthesis as a modular, structurally coupled problem and propose a deep learning framework to improve robustness across heterogeneous MRI conditions. The model is designed to adapt to variations in field strength and imaging protocols while preserving anatomical consistency. Experiments on multi-site datasets demonstrate improved performance and generalization compared with conventional approaches. The proposed method enables reliable CT synthesis across heterogeneous MRI settings, supporting broader clinical translation.
[CV-233] SPAT: A Semantic Port-Aware Adaptive-Rate Transmission Protocol for Semantic Communication
【速读】:该论文旨在解决6G语义通信中传统传输机制因依赖显式端口头部信息而易受头部损坏导致分组丢失的问题。解决方案的关键在于提出一种语义端口感知自适应速率传输协议(Semantic Port-Aware Adaptive-Rate Transmission Protocol, SPAT),其核心创新包括:将源端口与目的端口信息联合嵌入语义表征中,从而降低对显式端口头部的依赖;设计上行和下行差异化语义处理机制,其中上行引入端口识别用于服务识别,下行采用目标感知条件门控实现选择性解码;同时集成自适应速率控制器,根据信道状态和特征重要性动态调整传输语义通道数量,以提升传输鲁棒性与效率。
链接: https://arxiv.org/abs/2605.00897
作者: Yunhao Wang,Shuai Ma,Bin Shen,Shouhan Shi,Youlong Wu,Guangming Shi,Xiang Cheng
机构: Peking University, Shenzhen (北京大学深圳研究生院); Peng Cheng Laboratory (鹏城实验室); China University of Mining and Technology (中国矿业大学); ShanghaiTech University (上海科技大学); Xidian University (西安电子科技大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:With the evolution of 6G, semantic communication has emerged as a promising paradigm by prioritizing the delivery of task-relevant meaning over strict bit-level correctness. However, existing transport mechanisms still rely on explicit port headers and bit-level validation, making them vulnerable to header corruption and the resulting packet loss. To address this issue, this paper proposes a Semantic Port-Aware Adaptive-Rate Transmission Protocol (SPAT) for semantic communication. The proposed framework jointly embeds source and destination port information into semantic representations, thereby reducing dependence on explicit port headers while enabling robust port-aware transmission. Furthermore, a differentiated semantic processing mechanism is developed for uplink and downlink scenarios, where port identification is introduced for uplink service recognition and destination-aware conditional gating is designed for downlink selective decoding. In addition, an adaptive-rate controller is incorporated to dynamically adjust the number of transmitted semantic channels according to channel conditions and feature importance, thereby improving both robustness and transmission efficiency. Experimental results on the AFHQ and ImageNet-10 datasets, together with real-world experimental measurements, demonstrate that SPAT consistently outperforms TCP, UDP, and SITP in reconstruction quality across different SNRs while maintaining low-latency transmission.
[CV-234] A Coupled Fourth Order Telegraph Diffusion Framework Using Grayscale Indicators for Image Despeckling
【速读】:该论文旨在解决相干成像系统(如合成孔径雷达(SAR)和医学超声)中斑点噪声(Speckle Noise)对图像质量的严重影响问题。传统基于二阶偏微分方程(PDE)的去斑方法虽广泛应用,但常引入阶梯状伪影并模糊细节。解决方案的关键在于提出一种非线性、四阶耦合双曲-抛物型PDE模型:通过一个四阶扩散项实现高效去噪与平滑强度过渡,同时引入另一个演化方程用于细化边缘指示函数以保护纹理和结构特征;扩散系数自适应地结合图像灰度强度变量 $ u $ 与灰度基指示函数,从而实现结构感知去噪,避免块状伪影并保留细粒度信息。该模型在理论层面通过Schauder不动点定理证明了弱解的存在性,并采用高斯-赛德尔迭代的有限差分法实现高效计算,实验表明其在PSNR、MSSIM和斑点指数上显著优于现有耦合二阶PDE模型(HPCPDE)和四阶电报扩散模型(TDFM)。
链接: https://arxiv.org/abs/2605.00881
作者: Manish Kumar,Rajendra K. Ray
机构: Indian Institute of Technology Mandi(印度理工学院曼迪分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Speckle noise severely limits the quality of images acquired from coherent imaging systems such as Synthetic Aperture Radar (SAR) and medical ultrasound. Traditional second-order PDE-based despeckling approaches, although popular, often introduce staircase artifacts and blur fine details. To overcome these limitations, we present a nonlinear, fourth-order coupled hyperbolic-parabolic PDE model that effectively reduces noise while preserving the structure. The framework consists of two evolution equations: one governing fourth-order diffusion for effective speckle reduction and smooth intensity transitions, and another refining an edge indicator to protect textures and structural features. The diffusion coefficient is adaptively constructed using both the image intensity variable u and a grayscale-based indicator function, ensuring structure-aware denoising while avoiding blocky artifacts and preserving fine details. We also prove the existence of a weak solution to the proposed model by applying Schauder fixed-point theorem. A finite-difference scheme with Gauss Seidel iteration is employed for efficient implementation. We compare the proposed model with the existing coupled second-order PDE model (HPCPDE) and the fourth-order telegraph diffusion model (TDFM). The results show that our model consistently outperforms these approaches. Experiments on standard grayscale images, real SAR and ultrasound data, as well as speckle-corrupted color images, demonstrate that the proposed method achieves superior performance over conventional PDE-based techniques in terms of PSNR, MSSIM, and Speckle Index.
[CV-235] Multi-View Hierarchical Representation Learning of Fetal Hemodynamics for Maternal Hypertension Detection at the Edge
【速读】:该论文旨在解决妊娠期高血压疾病(Hypertensive Disorders of Pregnancy, HDP)诊断依赖间歇性袖带式血压测量所导致的偏差大、无法捕捉连续生理动态的问题。其核心解决方案是提出AutoHyPE——一种分层注意力网络,通过引入基于原型的对比学习和多视角策略,增强在长尾类别分布与生物变异性下的表征鲁棒性;该方法利用胎儿一维多普勒超声信号中蕴含的血流动力学特征,在无需额外设备的情况下实现了对母体高血压状态的准确检测(AUROC=0.80),并验证了其在边缘部署场景下的稳定性,为基于现有低成本超声技术实现连续、客观的产前健康监测提供了新范式。
链接: https://arxiv.org/abs/2605.00872
作者: Alireza Rafiei,Anahí Venzor Strader,Esteban Castro Aragón,Victoriana Rosibely Sut Serech,Enma Carolina Coyote Ixen,Reza Sameni,Peter Rohloff,Gari D. Clifford,Nasim Katebi
机构: Emory University(埃默里大学); Georgia Institute of Technology(佐治亚理工学院); Center for Indigenous Health Research, Wuqu’ Kawoq | Maya Health Alliance(土著健康研究中心,Wuqu’ Kawoq | 玛雅健康联盟)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hypertensive disorders of pregnancy remain a leading cause of maternal and fetal morbidity worldwide, yet diagnosis relies on intermittent cuff-based blood pressure measurements that are prone to bias and fail to capture continuous physiological dynamics. Growing evidence suggests that fetal cardiovascular activity is associated with maternal-placental hemodynamics and may encode markers of maternal hypertension. To analyze this, we collected a large-scale dataset of fetal one-dimensional Doppler ultrasound recordings paired with maternal blood pressure from 3,255 pregnant women across 8,170 antenatal visits in rural Guatemala. We developed AutoHyPE, a hierarchical attention network that models short- and long-term signal structure, incorporating a novel prototype-based contrastive learning and multi-view strategy to enhance representation robustness under long-tailed class distribution and biological variability. AutoHyPE achieved an AUROC of 0.80 for maternal hypertension detection, outperforming baseline approaches while maintaining balanced performance across classes, with no performance degradation in an edge deployment scenario. Our findings demonstrated that fetal cardiac mechanical activity contains hemodynamic features indicative of maternal hypertension status. This supports a promising paradigm shift toward continuous, objective monitoring of maternal health using existing, low-cost ultrasound technology and introduces a complementary approach to traditional methods based on blood pressure measurements, advancing scalable prenatal care.
[CV-236] NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals CVPR
【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)在处理多通道生理信号时面临的三大局限:固定核无法捕捉多尺度时间动态、马尔可夫状态更新限制全局上下文对周期性振荡的建模能力,以及通道独立处理忽略电极拓扑的空间结构信息。其解决方案的关键在于提出NAKUL架构,通过三个核心创新实现突破:(1) 动态核生成机制——利用元网络根据输入统计特征自适应加权不同时间尺度(3、5、7、11个时间步)的并行SSM分支,实现多尺度时间动态的灵活建模;(2) 频谱上下文建模——基于快速傅里叶变换(FFT)与可学习高斯频率带通滤波器,在O(N log N)复杂度下捕获全局周期模式;(3) 图引导空间注意力机制——利用固定电极拓扑提供空间偏置,指导多头注意力进行有原则的跨通道交互。该方法在脑机接口竞赛IV-2a运动想象任务中达到91.7%准确率,优于现有主流模型且参数更少、推理更快,同时在多种医疗信号任务中展现出良好泛化能力。
链接: https://arxiv.org/abs/2605.00871
作者: Badri N. Patro,Vijay S. Agneeswaran
机构: Microsoft(微软)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted CVPR Finding Track
Abstract:State space models (SSMs) achieve linear-time complexity but struggle with multi-channel physiological signals due to three limitations: fixed kernels cannot capture multi-scale temporal dynamics (motor preparation over hundreds of milliseconds vs. execution transients in tens of milliseconds), Markovian state updates restrict global context for periodic oscillations, and channel-independent processing ignores spatial electrode topology. We introduce NAKUL, extending SSMs for medical signal analysis through three contributions: (1) Dynamic Kernel Generation-parallel SSM branches with varying kernel sizes (3, 5, 7, 11 timesteps) are weighted by a meta-network that analyzes input statistics, enabling adaptive temporal scale selection; (2) Spectral Context Modeling-FFT-based operations with learnable Gaussian frequency band filters capture global periodic patterns in O(N \log N) complexity; (3) Graph-Guided Spatial Attention-fixed electrode topology provides spatial biases to multi-head attention for principled cross-channel interaction. On BCI Competition IV-2a motor imagery (our primary benchmark), NAKUL achieves 91.7 \pm 0.6% accuracy, matching EEG-Conformer (92.1 \pm 0.7%) while using 28% fewer parameters (2.5M vs 3.5M) and 2.0 \times faster inference (4.3ms vs 8.7ms). The model generalizes to EEG emotion recognition (83.6%), multimodal EEG-fMRI (91.4%), and medical imaging (92.8% on ultrasound), demonstrating architectural versatility. Ablations show dynamic kernels contribute +2.6% and exhibit interpretable scale selection patterns correlated with known neural dynamics.
[CV-237] Robust Cross-Domain WiFi Fall Detection via Physics-Driven Attention-Enhanced Transformers
【速读】:该论文旨在解决基于WiFi信道状态信息(CSI)的无设备跌倒检测系统在未见环境中性能严重下降的问题,其核心瓶颈在于静态背景过拟合和非直视路径(NLoS)信号衰减导致的域适应能力差。解决方案的关键在于提出一种具有物理感知特性的鲁棒且可泛化的框架,其核心创新包括:1)设计了一个物理驱动的动态方差门(Dynamic Variance Gate, DVG),通过计算局部时序方差作为软注意力掩码,有效抑制静态环境直流分量并增强人体运动动态特征;2)引入物理感知数据增强策略,促使网络学习对环境无关的形态学特征而非噪声;3)集成卷积块注意力模块(CBAM)以优化时空特征表示,再结合Transformer进行序列建模,从而实现跨域高精度跌倒检测,在完全未见过的环境中准确率达98.8%,且无需目标域微调。
链接: https://arxiv.org/abs/2605.00869
作者: Yingzhe Wang,Cunhua Pan,Ruijing Liu,Shaokai Li,Hong Ren,Kezhi Wang,Jiangzhou Wang
机构: National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China (东南大学移动通信国家重点实验室); Department of Computer Science, Brunel University London, UB8 3PH Uxbridge, U.K. (布鲁内尔大学计算机科学系); Pervasive Communication Research Center, Purple Mountain Laboratories, Nanjing 211111, China (紫金山实验室)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Device-free fall detection utilizing WiFi Channel State Information (CSI) has emerged as a promising, privacy-preserving solution for elderly health monitoring in the Internet of Things (IoT) era. However, existing deep learning approaches suffer from severe performance degradation when deployed in unseen environments due to static background overfitting and Non-Line-of-Sight (NLoS) signal attenuation. To address these critical bottlenecks, we propose a robust, domain-generalizable framework featuring a novel Attention-Enhanced CNN-Transformer hybrid architecture. First, we design a physics-driven \textbfDynamic Variance Gate (DVG) to dynamically calculate local temporal variance, acting as a soft-attention mask that eliminates static environmental DC components while amplifying dynamic human motion. Second, we introduce a Physics-Aware Data Augmentation strategy to force the network to learn invariant morphological signatures rather than environment-specific noise. Furthermore, a Convolutional Block Attention Module (CBAM) is integrated to refine spatiotemporal features prior to Transformer-based sequence modeling. Extensive cross-domain evaluations across four distinct indoor environments demonstrate that our method achieves 97.6% accuracy in NLoS scenarios and 98.8% in completely unseen environments without target-domain fine-tuning. Finally, we deploy the proposed framework on an edge computing system equipped with commercial WiFi NICs. Real-world live inference field tests confirm the system’s robustness against unseen environmental layouts and its capability for continuous, low-latency whole-home safety monitoring.
人工智能
[AI-0] Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters ICPR2026
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)模型在不同环境中的泛化能力不足问题,特别是由于算法和超参数配置对性能的敏感性所导致的泛化差距。现有研究虽已关注RL泛化性,但缺乏对具体配置因素如何量化贡献的系统分析与利用。解决方案的关键在于提出一种可解释框架,基于SHapley Additive exPlanations (SHAP) 方法量化各配置项对性能的影响,并建立Shapley值与泛化能力之间的理论联系;进而通过SHAP引导的配置选择策略,实现跨任务与环境的稳定性能提升,从而增强RL模型的泛化能力并为实践提供可操作指导。
链接: https://arxiv.org/abs/2605.02867
作者: Lingxiao Kong,Cong Yang,Oya Deniz Beyan,Zeyd Boukhers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 15 pages, 7 figures, accepted by ICPR 2026
Abstract:Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts. We establish a theoretical foundation connecting Shapley values to generalizability, empirically analyze configuration impact patterns, and introduce SHAP-guided configuration selection to enhance generalization. Our results reveal distinct patterns across algorithms and hyperparameters, with consistent configuration impacts across diverse tasks and environments. By applying these insights to configuration selection, we achieve improved RL generalizability and provide actionable guidance for practitioners.
[AI-1] Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross–Language Code Clone Detection
【速读】:该论文旨在解决跨语言代码克隆检测(Cross-language Code Clone Detection, X-CCD)中因不同编程语言间表面语法差异大而导致语义相似性难以识别的问题,同时应对大型语言模型(Large Language Models, LLMs)作为黑箱系统在成本、可复现性、隐私保护及输出格式不可靠等方面的局限。其核心解决方案是提出一种知识蒸馏框架,将DeepSeek-R1模型的推理能力迁移至轻量级开源学生模型(如Phi3和Qwen-Coder),并通过LoRA适配器进行微调;关键创新在于构建面向推理的合成训练数据,并引入响应稳定化方法(包括强制结论提示、二分类头与对比分类头),显著提升了紧凑模型在X-CCD任务中的可靠性与效率,尤其在分布外场景下表现更优,且分类头变体大幅降低推理时间。
链接: https://arxiv.org/abs/2605.02860
作者: Mohamad Khajezade,Fatemeh H. Fard,Mohamed Sami Shehata
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 38 pages
Abstract:Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA adapters. We further introduce response stabilization methods, including forced conclusion prompting, a binary classification head, and a contrastive classification head, and evaluate model behavior using both predictive metrics and response rate. Experiments on Python–Java, Rust–Java, Rust–Python, and Rust–Ruby show that knowledge distillation consistently improves the reliability of compact models and often improves predictive performance, especially under distribution shift. In addition, classification-head variants substantially reduce inference time compared to generation-based inference. Overall, our results show that reasoning-oriented distillation combined with response stabilization makes compact open-source models more practical and reliable for X-CCD detection. Comments: 38 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2605.02860 [cs.AI] (or arXiv:2605.02860v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-2] From Sensors to Insight: Rapid Edge-to-Core Application Development for Sensor-Driven Applications
【速读】:该论文旨在解决传感器数据在边缘到云连续体中从原始流转化为洞察的难题,尤其是在异构基础设施部署与执行管理方面对跨领域专业知识的高要求,从而阻碍了快速原型开发。其解决方案的关键在于提出一种以经验驱动的方法论,结合基于模式的工作流工程(pattern-based workflow engineering)与AI辅助开发(AI-assisted development),通过Pegasus在FABRIC测试床实现,并复用已有的Orcasound水听器工作流作为模板,构建适用于空气质量、地震和土壤湿度监测等场景的可重用抽象工作流结构,再通过模块化配置和放置机制将其扩展至边缘资源,显著降低非专家用户的入门门槛并支持分布式环境中迭代探索传感器驱动应用。
链接: https://arxiv.org/abs/2605.02859
作者: Komal Thareja,Anirban Mandal,Ewa Deelman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Scientists increasingly rely on sensor-based data, yet transforming raw streams into insights across the edge-to-cloud continuum remains difficult. Provisioning heterogeneous infrastructure and managing execution on emerging platforms like Data Processing Units typically requires cross-domain expertise, creating significant barriers to rapid prototyping. This paper introduces an experience-driven methodology for the rapid development of sensor-driven applications. By combining pattern-based workflow engineering with AI-assisted development-implemented via Pegasus on the FABRIC testbed - we utilize an existing Orcasound hydrophone workflow as a reusable template. We introduce a pattern-based engineering methodology to generate and refine workflows for air quality, earthquake, and soil moisture monitoring. Furthermore, we show how these abstract structures are extended to edge resources through modular configuration and placement. Our evaluation focuses on user productivity and practical lessons rather than peak performance. Through these case studies, we illustrate how AI-assisted, pattern-based development lowers the entry barrier for non-experts and enables iterative exploration of sensor-driven applications across distributed infrastructures. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2605.02859 [cs.DC] (or arXiv:2605.02859v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.02859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] (POSTER) From Sensors to Insight: Rapid Edge-to-Core Application Development for Sensor-Driven Applications
【速读】:该论文旨在解决传感器数据从边缘到云端的处理流程中,因跨领域专业知识要求高而导致的开发效率低下问题(即如何高效地将原始传感器流转化为可操作的洞察)。其解决方案的关键在于提出一种基于模式的、AI辅助的工作流开发方法,通过Pegasus工作流引擎在FABRIC测试平台上实现“意图优先”(intent-first)的设计范式,将原本以代码为中心的开发流程转变为以任务目标为导向的五步迭代开发循环。该方法利用可复用的模板(如Orcasound水听器工作流)快速生成并优化空气品质、地震和土壤湿度监测等应用工作流,并通过配置而非重写代码的方式适配边缘资源(如BlueField-3 DPU和Raspberry Pi),从而显著缩短开发周期(1–1.5天/工作流)且保持执行的严谨性和可移植性。
链接: https://arxiv.org/abs/2605.02844
作者: Komal Thareja,Anirban Mandal,Ewa Deelman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Scientists increasingly rely on sensor-based data; however transforming raw streams into insights across the edge-to-cloud continuum remains difficult due to the breadth of expertise required to coordinate the necessary data and computation flow. This paper introduces a pattern-based, AI-assisted methodology for rapid development of sensor-driven applications. Using Pegasus workflows executing on the FABRIC testbed, we demonstrate a 5-step development loop that shifts workflow construction and deployment from code-first to intent-first design. Starting from an existing Orcasound hydrophone workflow as a reusable template, we generate and refine workflows for air quality, earthquake, and soil moisture monitoring applications. We further show how these workflows extend to edge resources-including BlueField-3 DPUs and Raspberry Pis-through configuration and placement rather than workflow redesign. Our evaluation, from the perspective of a novice Pegasus user, shows that AI-assisted pattern reuse compresses multi-stage workflow development to 1-1.5 days per workflow while preserving the rigor and portability of workflow-based execution.
[AI-4] Compress Then Adapt? No Do It Together via Task-aware Union of Subspaces
【速读】:该论文旨在解决当前参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)与低秩压缩(Low-Rank Compression)通常采用串行流程所带来的子空间错位问题:即先压缩后微调的做法可能导致压缩后的低维子空间与下游任务目标不匹配,从而浪费全局参数预算并降低性能。其解决方案的关键在于提出JACTUS(Joint Adaptation and Compression with a Task-aware Union of Subspaces),通过联合优化压缩与适配过程,利用小规模校准集估计输入和预激活梯度协方差,构建与预训练权重子空间正交的联合子空间,并在此子空间内进行投影低秩近似;同时基于每参数边际增益全局分配秩资源,仅训练一个紧凑的核心矩阵,从而显式地将压缩保留的方向与适配所需方向耦合,实现高精度、低参数量且无需保留完整冻结权重的部署模型。
链接: https://arxiv.org/abs/2605.02829
作者: Jingze Ge,Yun Liu,Xue Geng,Wanqi Dong,Wang Zhe Mark,Min Wu,Xulei Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, supplementary material included
Abstract:Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation. From a small calibration set, JACTUS estimates input and pre-activation gradient covariances, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix. This explicitly mitigates the potential misalignment between the compressed subspace and downstream objectives by coupling the directions preserved for compression with those required for adaptation, yielding a deployable low-rank model that avoids retaining full frozen weights while enabling fast and robust tuning. On vision, JACTUS attains an average 89.2% accuracy on ViT-Base across eight datasets at 80% retained parameters, surpassing strong 100% PEFT baselines (e.g., DoRA 87.9%). On language, JACTUS achieves an 80.9% average on Llama2-7B commonsense QA at the same 80% retained-parameter budget, outperforming 100% PEFT (e.g., DoRA 79.7%) and exceeding prior compress-then-finetune pipelines under the same ratained-parameter budget. We will release code.
[AI-5] First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint
【速读】:该论文旨在解决概率性赋值(如Shapley值和半值)在现代机器学习应用中因精确计算需对指数级数量的联盟进行效用评估而导致的高计算成本问题,从而依赖蒙特卡洛近似方法。现有估计器虽采用不同识别策略(如加权平均、自归一化加权、回归调整和加权最小二乘),但其核心问题在于缺乏统一的误差结构分析以指导高效估计。论文的关键创新在于发现这些不同构造共享一个一致的一阶误差结构,其中主导项为由抽样分布和工作代理函数决定的增广逆概率加权影响项;基于此结构,作者提出效率感知的代理调整估计器(EASE),通过直接优化抽样分布与代理函数以最小化一阶均方误差(MSE),从而实现比现有最优估计器更优的统计效率。
链接: https://arxiv.org/abs/2605.02827
作者: Ziqi Liu,Kiljae Lee,Yuan Zhang,Weijing Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Probabilistic values, including Shapley values and semivalues, provide a model-agnostic framework to attribute the behavior of a black-box model to data points or features, with a wide range of applications including explainable artificial intelligence and data valuation. However, their exact computation requires utility evaluations over exponentially many coalitions, making Monte Carlo approximation essential in modern machine learning applications. Existing estimators are often developed through different identification strategies, including weighted averages, self-normalized weighting, regression adjustment, and weighted least squares. Our key observation is that these seemingly distinct constructions share a common first-order error structure, in which the leading term is an augmented inverse-probability weighted influence term determined by the sampling law and a working surrogate function. This first-order representation yields an explicit expression for the leading mean squared error (MSE), which characterizes how the sampling law and the surrogate jointly determine statistical efficiency. Guided by this criterion, we propose an Efficiency-Aware Surrogate-adjusted Estimator (EASE) that directly chooses the sampling law and surrogate to minimize the first-order MSE. We demonstrate that EASE consistently outperforms state-of-the-art estimators for various probabilistic values.
[AI-6] SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering
【速读】:该论文旨在解决大语言模型在知识图谱(Knowledge Graph, KG)推理过程中,因过程奖励模型(Process Reward Model)存在风险补偿效应而导致的错误步骤被后续正确步骤掩盖的问题,尤其在医疗和法律等高风险场景下更为显著。其解决方案的关键在于提出一种基于模式感知的累积过程奖励模型(Schema-aware Cumulative Process Reward Model, SCPRM),该模型通过条件化当前推理前缀(reasoning prefix),并引入当前推理步骤与查询中隐式目标之间的模式距离(schema distance),实现对推理路径的累积性、前瞻性奖励评估,从而引导更准确且风险敏感的多跳推理路径探索。进一步地,将SCPRM集成至蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)框架中形成SCPRM-MCTS,在多个KG问答(KGQA)任务上显著提升了Hits@k指标。
链接: https://arxiv.org/abs/2605.02819
作者: Jiujiu Chen,Yazheng Liu,Sihong Xie,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models excel at complex reasoning, yet evaluating their intermediate steps remains challenging. Although process reward models provide step-wise supervision, they often suffer from a risk compensation effect, where incorrect steps are offset by later correct ones, assigning high rewards to flawed reasoning paths. This issue is further exacerbated in knowledge graph (KG) reasoning, as there may exist multiple paths between the start and end entities in the KGs, and a risky step can make the reasoning path flawed. Those limitations are problematic in risk-sensitive tasks such as medical and legal KG reasoning. To address the issues, we propose a Schema-aware Cumulative Process Reward Model (SCPRM) that evaluates reasoning paths by conditioning on the reasoning prefix , and incorporating schema distance between current reasoning step and the implicit target parsed from the query, which provides cumulative and future rewards to guide the path explorations. We further integrate SCPRM into Monte Carlo Tree Search (MCTS) as SCPRM-MCTS to conduct multi-hop reasoning on KGs for question answering (QA) tasks. Across medical and legal KGQA and CWQ, SCPRM-MCTS improves the performance of Hits@k by an average of 1.18% over strong baselines, demonstrating more accurate and risk-sensitive reasoning evaluation.
[AI-7] AIs and Humans with Agency
【速读】:该论文试图解决的问题是:如何在人工智能(AI)系统中实现与人类相当的代理能力(agency),即让AI具备自主决策和行动的能力。当前大型语言模型(LLM)在模拟人类行为方面取得进展,但其代理能力仍受限于缺乏与现实世界情境深度融合的架构设计。论文指出,人类的代理能力需依赖前额叶皮层(frontal lobe)的长期发育,而现有AI系统难以复制这一过程。解决方案的关键在于构建一种新型架构,使AI在每个真实应用场景中能与人类参与者协同制定行动方案和计划,从而实现人机共治的动态代理机制。
链接: https://arxiv.org/abs/2605.02810
作者: David Mumford
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper compares agency in humans with potential agency in AI programs. Human agency takes many years to develop, as the frontal lobe is activated. Early attempts to endow LLMs agency have met serious obstacles. Progress requires a new architecture where actions and plans are formulated jointly with the human actors in each real world setting.
[AI-8] Static Analysis of Recursive SHACL KR2026
【速读】:该论文旨在解决SHACL文档之间的蕴含(implication)判定问题,即判断所有满足一个SHACL文档的RDF数据图是否也必然满足另一个SHACL文档。不同于以往仅关注形状表达式(shape expressions)蕴含的研究,本文首次系统地考虑了包含递归形状定义和目标(targets)的完整SHACL文档。研究发现,在支持模型语义(supported model semantics)和稳定模型语义(stable model semantics)下,该问题是不可判定的,即使在使用描述逻辑ALCIO作为形状表达式的片段中也是如此;然而,在最小模型语义(well-founded semantics)下,问题变为可判定,并且具有单指数时间复杂度。其关键解决方案在于提出了一种将SHACL在最小模型语义下的语义翻译为全混合mu-演算(full hybrid mu-calculus)的方法,揭示了最小模型与固定点模态逻辑之间新的联系,并设计了一个基于自动机的最优决策过程,从而实现了该问题的高效判定。
链接: https://arxiv.org/abs/2605.02787
作者: Anouk Oudshoorn,Magdalena Ortiz,Mantas Simkus
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, long version of work to be published in the proceedings of KR 2026
Abstract:SHACL (Shapes Constraint Language) expresses constraints on RDF data by means of so-called shapes. Its central service is validation: verifying whether a data graph complies with a SHACL document. But so far, there are no static analysis services to compare documents. In this paper, we study the following problem: decide whether all graphs that validate one SHACL document also validate another. Unlike previous works that have considered the implication of shape expressions only, we consider documents comprising (recursive) shape definitions and targets. We show that implication (a.k.a. containment) is undecidable under the supported and the stable model semantics, even for the fragment that uses the description logic ALCIO for shape expressions. Under the well-founded semantics, in surprising contrast, it is decidable in single exponential time. Our key technical contribution is a translation of SHACL under the well-founded semantics into the full hybrid mu-calculus, revealing a novel link between well-founded models and a fixed point modal logic, and a worst-case optimal automata-based decision procedure.
[AI-9] Fine-Grained Graph Generation through Latent Mixture Scheduling
【速读】:该论文旨在解决结构感知图生成(structure aware graph generation)中缺乏细粒度结构控制的问题,即现有方法仅能对图的拓扑属性提供粗粒度调控,难以在保证图生成质量的同时实现精确的结构约束满足。其解决方案的关键在于提出一种新型的条件变分自编码器(conditional variational autoencoder),通过动态对齐图驱动与属性驱动的表示来优化解码器的潜在空间,从而提升图保真度与控制满足度;具体而言,该方法引入了一个混合调度器(mixture scheduler),逐步融合图先验与控制先验,实现了从粗到细的结构控制能力。
链接: https://arxiv.org/abs/2605.02780
作者: Nidhi Vakil,Hadi Amiri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Structure aware graph generation aims to generate graphs that satisfy given topological properties. It has applications in domains such as drug discovery, social network modeling, and knowledge graph construction. Unlike existing methods that only provide coarse control over graph properties, we introduce a novel conditional variational autoencoder for fine-grained structural control in graph generation. The approach refines the decoder’s latent space by dynamically aligning graph- and property-driven representations to improve both graph fidelity and control satisfaction. Specifically, the approach implements a mixture scheduler that progressively integrates graph and control priors. Experiments on five real-world datasets show the efficacy of the proposed model compared to recent baselines, achieving high generation quality while maintaining high controllability.
[AI-10] A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance
【速读】:该论文旨在解决离线安全强化学习中策略在部署时需适应动态变化的安全预算(cost limit)的问题,尤其针对现有基于扩散模型的引导方法在处理奖励优化与约束满足之间的冲突时导致的安全合规性不可靠问题。解决方案的关键在于提出Safe Decoupled Guidance Diffusion (SDGD),其核心思想是将自适应安全轨迹生成重新建模为从受限轨迹分布中采样,其中安全预算定义了轨迹区域,而奖励则在此区域内引导偏好选择;通过分类器无关引导(classifier-free guidance)对成本限制进行条件化以偏向满足预算的轨迹,同时使用奖励梯度引导提升回报。此外,引入可行轨迹重标注(Feasible Trajectory Relabeling, FTR)机制来重塑奖励目标,防止奖励引导导致累积成本漂移,并提供了首个一阶采样时间分析证明FTR在前缀恢复对齐条件下可抑制奖励诱导的成本漂移。
链接: https://arxiv.org/abs/2605.02777
作者: Rufeng Chen,Zhaofan Zhang,Zhejiang Yang,Hechang Chen,Sihong Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.
[AI-11] Bolek: A Multimodal Language Model for Molecular Reasoning
【速读】:该论文旨在解决分子属性预测模型在药物发现中缺乏可审计性的问题,即传统预测模型仅输出分数而无推理依据,而语言模型虽能生成自然语言解释但常与输入分子结构脱节。其解决方案的关键在于提出一种名为Bolek的紧凑多模态语言模型,通过将摩根指纹(Morgan fingerprint)嵌入注入到指令微调的文本解码器中,使自然语言推理严格锚定于分子结构特征;同时,在分子对齐任务(如分子描述、RDKit描述符预测和子结构检测)及下游15个TDC二分类任务上引入基于具体分子特征的合成思维链(chain-of-thought)监督信号,从而显著提升模型性能与解释的可验证性。
链接: https://arxiv.org/abs/2605.02745
作者: Frederic Grabowski,Jacek Szczerbiński,Maciej Jaśkowski,Kalina Jasińska-Kobus,Paweł Dąbrowski-Tumański,Tomasz Jetka,Bartosz Topolski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule. We introduce Bolek, a compact multimodal language model that grounds natural-language reasoning in molecular structure by injecting a Morgan fingerprint embedding into an instruction-tuned text decoder. Bolek is fine-tuned on molecular alignment tasks, including molecule description, RDKit descriptor prediction, and substructure detection, and on downstream reasoning over 15 TDC binary classification tasks using synthetic chains-of-thought anchored in concrete molecular features. Across these tasks, Bolek outperforms its Qwen3-4B-Instruct base on all endpoints in yes/no mode and on 13 of 15 in chain-of-thought mode, raising mean ROC/PR AUC from 0.55 to 0.76. It also outperforms TxGemma-9B-Chat on 13 of 15 binary classification tasks despite being less than half its size. Bolek’s explanations are more grounded than those of the baseline LLMs: it cites numerical descriptors 10-100x more often per chain-of-thought, and the cited values agree strongly with RDKit for key descriptors such as TPSA, MolLogP, and MolWt (Spearman rho = 0.87-0.91). Generalisation extends beyond the training panel: on 15 unseen TDC classification endpoints, Bolek matches TxGemma on five, and it produces non-trivial rank correlations on three held-out regression endpoints despite never seeing downstream regression during training. These results suggest that targeted modality injection and reasoning supervision tied to verifiable molecular features can yield compact, auditable molecular reasoning models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM) Cite as: arXiv:2605.02745 [cs.LG] (or arXiv:2605.02745v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02745 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bartosz Topolski [view email] [v1] Mon, 4 May 2026 15:46:39 UTC (213 KB)
[AI-12] AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent -Driven Development
【速读】:该论文旨在解决生成式 AI 在自动化软件工程中忽视长期可维护性的问题,特别是AI生成代码中存在的技术债(Technical Debt)及其随模型能力提升而加剧的架构退化现象。其核心发现是存在一个“推理-复杂度权衡”(Reasoning-Complexity Trade-off),即随着大语言模型能力增强,生成代码趋于臃肿和耦合,进而导致结构质量下降;并提出“体积-质量反比定律”(Volume-Quality Inverse Law),指出代码体量是结构劣化的强预测因子。解决方案的关键在于:从单纯追求功能正确性的提示驱动生成范式,转向赋予AI代理显式的架构预见能力(Architectural Foresight),以实现既功能性又可维护性的软件构建。
链接: https://arxiv.org/abs/2605.02741
作者: Yuecai Zhu,Nikolaos Tsantalis,Peter C. Rigby
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical debt in AI-generated software, revealing that AI does not eliminate flaws but rather introduces a distinct machine signature of defects. Our multi-scale analysis, spanning single-file algorithmic tasks and complex, agent generated systems, identifies a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code. This architectural decay is so pronounced that we establish a Volume-Quality Inverse Law, where code volume is a near perfect predictor of structural degradation. Crucially, we demonstrate that neither functional correctness nor detailed prompting mitigates this decay. These findings challenge the current paradigm of prompt-driven generation, reframing the central problem of AI-based software engineering from one of code generation to one of architectural complexity management. We conclude that future progress depends on equipping agents with explicit architectural foresight to ensure the software they build is not just functional, but also maintainable.
[AI-13] AI and Open-data Driven Scalable Solar Power Profiling
【速读】:该论文旨在解决当前屋顶太阳能光伏(Rooftop Photovoltaic, PV)部署空间分布与容量信息缺乏详细、实时数据支持的问题。为实现城市尺度的太阳能发电量动态建模,研究提出了一种基于基础视觉AI模型(foundation vision AI models)的开放、可扩展检测框架:其关键在于利用开源卫星影像自动识别屋顶光伏面板几何形态,无需人工标注或针对特定场景训练模型,同时保持对异构图像数据的鲁棒性;检测结果生成地理参考多边形,结合开放气象数据转化为区域级太阳能功率曲线,从而构建可增量扩展的空间显式太阳能资源数据库。该方法显著降低了对专有影像和封闭模型的依赖,提升了太阳能规划与分析的透明度与可扩展性。
链接: https://arxiv.org/abs/2605.02738
作者: Shiliang Zhang,Sabita Maharjan,Damla Turgut
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Solar photovoltaic (PV) deployment is expanding rapidly, yet detailed, up-to-date information on the spatial distribution and capacity of rooftop PV remains limited. This paper presents an open, scalable framework for detecting solar panels from open data and generating city-level solar power profiles. We leverage foundation vision AI models to detect solar panel geometries from open-source satellite imagery. This avoids manual data labeling and case-specific model training while maintaining robustness across heterogeneous imagery. Detected solar panels are converted into georeferenced polygons, yielding spatially explicit and incrementally extensible inventories. By integrating open weather data, we translate panel footprints into regional solar power profiles. The framework reduces dependency on proprietary imagery, manual labeling, and closed-source models, and offers a transparent and scalable approach for solar planning and analysis. We released the data and an API resulted from this work. For any user-specified building location, our API retrieves aerial imagery, detects rooftop solar panels, and returns georeferenced polygons. This empowers researchers and developers to scan user-defined areas to build solar panel maps and associated solar production profiles, thus facilitating advanced analysis like distributed solar production integration, local power flow optimization, energy tariff design, and infrastructure planning.
[AI-14] Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
【速读】:该论文旨在解决生成式 AI (Generative AI) 在具有层次化多标签决策场景下的“可协同推理”问题,尤其在医学影像分析中,当模型需要决定是否将诊断任务委托给专家时,若沿用传统的扁平标签空间下学习去延迟(Learning to Defer, L2D)方法,会导致决策不一致,如违反临床分类体系、重复委托或遗漏隐含标签等现象。其核心挑战在于:如何在保持模型自主判断能力的同时,确保委托行为符合层级结构的语义约束。解决方案的关键是提出两种机制——一是精确一致投影(exact coherent projection),通过动态规划在一致动作集合上进行解码,实现对委托决策的严格约束;二是分类信念传播与递归策略优化(Taxonomic Belief Propagation with Recursive Policy Optimisation, TBP+RPO),构建一种基于分类契约感知的联合动作模型,并利用与推理阶段相同的递归结构进行训练,从而在保持高判别性能的同时显著降低委托不一致性。
链接: https://arxiv.org/abs/2605.02734
作者: Joshua Strong,Pramit Saha,Emma Sun,Helen Higham,Alison Noble
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model’s own assertions. We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. We then propose two remedies: exact coherent projection, a dynamic-programming decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO), a contract-aware joint action model trained through the same recursion used at inference. Across real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D exhibits non-trivial incoherence. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.
[AI-15] ORPilot: A Production-Oriented Agent ic LLM -for-OR Tool for Optimization Modeling
【速读】:该论文旨在解决将现实业务问题转化为可求解优化模型的难题,尤其针对生产环境中存在的模糊描述、大规模原始操作数据及跨求解器后端可移植性需求。传统基于大语言模型(Large Language Model, LLM)的运筹学(Operations Research, OR)工具通常假设问题规格清晰且数据已预格式化,难以适配真实场景。解决方案的关键在于提出ORPilot系统,其核心创新包括:(1)对话式访谈代理以完整提取问题规范,(2)独立于提示的数据采集代理,(3)参数计算代理将原始表格数据映射为模型可用参数,以及(4)与求解器无关的中间表示(Intermediate Representation, IR),支持零LLM调用下的确定性重编译至Gurobi、CPLEX、PuLP、Pyomo或OR-Tools等求解器,并引入自纠正重试循环利用求解器回溯信息进行定向修复。此设计使ORPilot成为首个面向生产级业务问题而非教科书案例的智能优化建模系统。
链接: https://arxiv.org/abs/2605.02728
作者: Guangrui Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents ORPilot, an open-source agentic AI system that translates real-world business problems into solver-ready optimization models. Unlike academic LLM-for-OR tools that assume clean problem specifications with preformatted inline data, ORPilot is designed for production conditions: ambiguous descriptions, large-scale raw operational data, and the need for portability across solver backends. The system introduces four novel components: (1) a conversational interview agent to elicit complete problem specifications, (2) a data collection agent that retrieves data independently of prompts, (3) a parameter computation agent to bridge raw tabular data and model-ready parameters, and (4) a solver-agnostic Intermediate Representation (IR) for deterministic, zero-LLM-call recompilation to Gurobi, CPLEX, PuLP, Pyomo, or OR-Tools solvers. Additionally, self-correcting retry loops utilize solver tracebacks for targeted repairs. ORPilot represents the first attempt to target production-level business problems rather than textbook operations research (OR) cases. Evaluation on real-world problems demonstrates promising results. When tested against traditional academic benchmarks: IndustryOR, NL4OPT and NLP4LP, ORPilot outperformed state-of-the-art tools in accuracy on the IndustryOR benchmark and delivered comparable performance on NL4OPT and NLP4LP.
[AI-16] An Empirical Study of Agent Skills for Healthcare: Practice Gaps and Governance
【速读】:该论文旨在解决当前医疗AI代理(Healthcare AI Agents)在跨机构部署时因本地流程和组织约束导致的能力迁移困难问题。其解决方案的关键在于提出并实证分析“医疗代理技能”(healthcare agent skills)作为可复用的程序层,通过系统性地筛选与标注557个公共医疗技能,揭示其在功能、部署场景、自主性和安全性等方面的分布特征,从而为构建更适配多场景的医疗代理提供结构化基础,并指出现有基准测试和风险框架尚未充分覆盖此类技能层面的挑战。
链接: https://arxiv.org/abs/2605.02709
作者: Gelei Xu,Ningzhi Tang,Xueyang Li,Toby Jia-Jun Li,Zhi Zheng,Wei Jin,Yiyu Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare automation is shaped by local procedures and organizational constraints, so agent capabilities rarely transfer unchanged across settings. Agent skills, self-contained directories that package reusable procedures for AI agents, are emerging as a procedural layer for adapting healthcare agents across diverse healthcare settings. We present the first empirical analysis of healthcare agent skills, drawing on 557 healthcare-related skills filtered from 58,159 public skills on ClawHub and annotated along ten dimensions covering function, deployment context, autonomy, and safety. We find that public healthcare skills emphasize patient-facing workflow automation and monitoring rather than the diagnostic and treatment-oriented tasks foregrounded in healthcare-agent research; coverage of the healthcare lifecycle and specialized clinical inputs remains uneven; and general technical risk does not reliably capture clinical risk. These findings position healthcare skills as a procedural layer not yet addressed by current benchmarks and risk frameworks.
[AI-17] Caliper-in-the-Loop: Black-Box Optimization for Hyperledger Fabric Performance Tuning
【速读】:该论文旨在解决Hyperledger Fabric性能调优中因配置参数众多且相互耦合导致的手动调优困难问题。其核心解决方案是将基准测试视为一个带有噪声的黑箱优化问题,并引入贝叶斯优化(Bayesian Optimization, BO)结合降维(Dimensionality Reduction, DR)技术,构建了一个端到端的Caliper-in-the-loop自动化调优管道,通过部署候选配置、执行基准测试并基于观测到的吞吐量更新优化器来实现高效搜索。关键创新在于利用DYCORS-PCA等BO+DR组合方法在317维高维空间中有效探索最优配置,实验表明该方法相比初始配置可提升12%的每秒事务处理数(Transactions Per Second, TPS),验证了其在实际场景中的有效性与实用性。
链接: https://arxiv.org/abs/2605.02690
作者: Yash Madhwal,Arseny Bolotnikov,Mark Prikhno,Irina Lebedeva,Ivan Laishevskiy,Vladimir Gorgadze,Artem Barger,Yury Yanovich
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperledger Fabric performance depends on many interacting configuration parameters, making manual tuning difficult. We study automated throughput tuning by treating benchmarking as a noisy black-box optimization problem and applying Bayesian optimization (BO) with dimensionality reduction (DR). We implement an end-to-end Caliper-in-the-loop pipeline that deploys candidate configurations, benchmarks them, and updates the optimizer from observed throughput. The search space, derived from Fabric configuration files, has 317 dimensions. In a cloud testbed, we evaluate 16 BO+DR variants and a random-search baseline. The best method, DYCORS-PCA, achieves a 12% TPS improvement relative to the first evaluated configuration, while MPI-REMBO achieves 9%. These results suggest that BO with DR is a practical approach for high-dimensional Hyperledger Fabric tuning, while also highlighting the role of measurement noise in interpreting gains.
[AI-18] Hybrid Inspection and Task-Based Access Control in Zero-Trust Agent ic AI
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体(agent)在动态调用工具和访问受保护资源时所面临的严重安全风险问题,尤其关注多轮对话和分布式协作场景下传统委托授权机制因缺乏对原始用户意图的可见性而导致的安全漏洞。解决方案的关键在于提出一种持续的智能体语义授权框架(Continuous Agent Semantic Authorization, CASA),其核心创新包括:一是设计了一个结合确定性控制与语义检查的混合运行时强制模型,通过零信任拦截层实现结构完整性保障与任务意图一致性验证;二是将任务导向的访问控制(Task-Based Access Control, TBAC)扩展至多轮交互场景,分两阶段进行语义分析——在拦截层提取多轮对话中的任务目标,在授权服务器端进行任务与工具的语义匹配;三是构建了包含相关与无关工具调用的新颖多轮对话-工具数据集,首次提供了TBAC在多轮交互下的实验评估结果。
链接: https://arxiv.org/abs/2605.02682
作者: Majed El Helou,Benjamin Ryder,Chiara Troiani,Jean Diaconu,Hervé Muyal,Marcelo Yannuzzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper page at this https URL
Abstract:Authorizing Large Language Model (LLM)-driven agents to dynamically invoke tools and access protected resources introduces significant security risks, and the risks grow dramatically as agents engage in multi-turn conversations and scale toward distributed collaboration. A compromised or malicious agentic application can tamper with tool calls, falsify results, or request permissions beyond the scope of the subject’s intended tasks, which could go unnoticed with current delegated authorization flows given their lack of visibility into the original subject’s intent. In light of this, we make the following contributions towards Continuous Agent Semantic Authorization (CASA). First, we propose a hybrid runtime enforcement model that combines deterministic and semantic controls enabled by a zero-trust interception layer. Five deterministic controls enforce structural and data-integrity guarantees over the message flow, while a semantic inspection layer evaluates whether tool call choices align with the intended tasks commissioned to the agent. Second, differently from prior Task-Based Access Control (TBAC) techniques that operate on single-turn interactions, we decompose the semantic layer into two stages: i) a task-extraction step that distills the subject’s objectives from multi-turn conversations at the interception layer, and ii) a task-tool semantic matching step at the authorization server that evaluates whether the requested tools are appropriate for the extracted tasks. Third, we extend the ASTRA dataset that we introduced in a prior work, by generating novel conversation-tool datasets with multi-turn interactions containing relevant and irrelevant tool calls for a given task. Lastly, we provide the first experimental results for TBAC under multi-turn conversations.
[AI-19] he Design and Composition of Structural Causal Decision Processes
【速读】:该论文旨在解决计算系统经济建模中决策代理的因果机制难以准确刻画的问题,特别是现有模型在处理认知资源内生限制、价值贴现以及信念形成非理性情形时的不足。其解决方案的关键在于提出两类新的因果决策模型:结构因果决策模型(Structural Causal Decision Models, SCDMs)和结构因果决策过程(Structural Causal Decision Process, SCDP)。SCDMs 扩展了结构因果影响模型(Structural Causal Influence Models, SCIMs),显式建模变量间的因果关系与决策收益,并允许决策受因果前因约束,且支持未指定概率分布或结构方程的开放根变量;而 SCDP 作为具有折扣变量的重复 SCDM,具备良好的组合性,且能内生建模记忆形成过程与可变贴现,从而超越部分可观测马尔可夫决策过程(POMDPs)的假设局限,适用于数字经济发展中的政策仿真、信息系统机制设计及网络基础设施数字孪生建模等场景。
链接: https://arxiv.org/abs/2605.02681
作者: Sebastian Benthall,Alan Lujan
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注:
Abstract:We present two new classes of causal models of decision-making agents. Our approach is motivated by the needs of modeling the economics of computing systems. These systems are composed of subsystems and can exhibit endogenous limits on cognitive resources and value discounting. Structural Causal Decision Models (SCDMs) expand on Structural Causal Influence Models. Like SCIMs, they explicitly represent the causal relationships between model variables and the payoffs of agent decisions. Additionally, agent decisions can be constrained by their causal antecedents, and SCDMs can have open root variables for which no probability distribution or structural equation is given. We show that SCDMs have a well-defined and computationally useful property of composability. Building on SCDMs, we then define a Structural Causal Decision Process (SCDP) as a recurring SCDM with a discount variable. SCDPs benefit from the useful composition properties of SCDMs. Moreover, SCDPs are strictly more expressive than POMDPs because they do not assume rational belief formation. Indeed, an SCDP can endogenously model the memory-formation process, and is thus useful for modeling resource rational agents in dynamic settings. SCDPs are also capable of modeling variable discounting, a tool used widely in social scientific modeling. We pose that SCDPs are a useful framework for policy simulation for the digital economy, mechanism design for information systems, and digital twin modeling of cyberinfrastructure.
[AI-20] An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES
【速读】:该论文旨在解决药物诱导肝损伤(Drug-induced liver injury, DILI)预测中现有计算模型依赖二分类标签导致泛化能力弱且缺乏机制解释的问题。传统方法无法为转化决策提供可理解的毒理学依据,限制了其在药物研发中的应用价值。解决方案的关键在于提出DILER基准数据集和HADES智能体系统:DILER通过整合来自生物医学文献的肝毒性机制假设扩展了原始分子数据的标注维度;HADES则基于分子级预测、代谢物分解、结构理解与毒性通路证据,生成透明且可审计的推理链条,从而实现从“黑箱预测”向“可解释假说生成”的范式转变。实验表明,HADES在二分类任务上优于现有模型,并首次建立了机制假说生成的基线指标(Hypothesis Alignment Fuzzy Jaccard Index = 0.16),验证了其在提升预测可解释性和指导药物开发方面的潜力。
链接: https://arxiv.org/abs/2605.02669
作者: Maciej Wisniewski,Bartosz Topolski,Pawel Dabrowski-Tumanski,Dariusz Plewczynski,Tomasz Jetka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Drug-induced liver injury (DILI) remains a leading cause of late-stage clinical trial attrition. However, existing computational predictors primarily rely on binary classification, a framing that limits generalization and yields no mechanistic insight to guide translational decisions. We argue that DILI prediction is better posed as an explainable hypothesis-generation problem. To support this shift, we introduce the DILER Benchmark, a dataset that extends beyond binary labels by augmenting a curated set of molecules with mechanistic hepatotoxicity hypotheses derived from biomedical literature. We further present HADES, an agentic system designed to generate transparent and auditable reasoning traces. By combining molecular-level predictions, metabolite decomposition, structural understanding, and toxicity pathway evidence, HADES mechanistically assesses DILI risk. Evaluated on the DILER Benchmark, HADES outperforms existing models in binary classification, achieving a ROC-AUC of 0.68 on the Test Set and 0.59 on the challenging Post-2021 Set, compared with 0.63 and 0.50 for DILI-Predictor, respectively. More importantly, we establish a baseline for mechanistic hypothesis generation, where HADES achieves a Hypothesis Alignment Fuzzy Jaccard Index of 0.16. This result underscores the inherent complexity of the task while highlighting the need for advanced explainable approaches in predictive toxicology. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.02669 [cs.AI] (or arXiv:2605.02669v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-21] AcademiClaw: When Students Set Challenges for AI Agents
【速读】:该论文旨在解决当前OpenClaw生态系统中缺乏对AI代理在学术层级任务上能力评估的问题,即现有基准测试仅聚焦于助理级任务,未能充分检验模型处理复杂、长期规划的学术场景的能力。其解决方案的关键在于提出AcademiClaw——一个包含80个源自大学生真实学术工作流(如作业、科研项目、竞赛和自主项目)的双语基准,覆盖25个以上专业领域,涵盖从奥林匹克数学到GPU密集型强化学习等多样化任务,并通过多维评分体系与独立安全审计实现精细化评估。实验表明,即使是最先进的模型在该基准上的通过率也仅为55%,揭示了模型在不同任务域中的能力边界、行为策略差异及token消耗与输出质量之间的脱节,从而为提升AI代理在真实学术场景下的泛化能力和可靠性提供了细粒度诊断工具。
链接: https://arxiv.org/abs/2605.02661
作者: Junjie Yu,Pengrui Lu,Weiye Si,Hongliang Lu,Jiabao Wu,Kaiwen Tao,Kun Wang,Lingyu Yang,Qiran Zhang,Xiuting Guo,Xuanyu Wang,Yang Wang,Yanjie Wang,Yi Yang,Zijian Hu,Ziyi Yang,Zonghan Zhou,Binghao Qiang,Borui Zhang,Chenning Li,Enchang Zhang,Feifan Chen,Feng Jian,Fengyin Sun,Hao Qiu,Hao Zheng,Haoran Zhu,Hongyu Liu,Jianbin Deng,Jiaxin Song,Jiaying Chi,Jiayou Shi,Jie Fang,Jinghui Zhong,Jingyu Zhou,Jinze Li,Junfeng Yi,Junyan Yu,Junzhi Xue,Ni Song,Pengyi Chen,Qi Chen,Quansheng Li,Rui Tao,Shenghai Gong,Shenhang Lu,Tianqi Shen,Tianxiang Zhu,Tiehan Kang,Tingyu Li,Wendi Wu,Xiao Shen,Xiao Zhou,Xiaotao Zhang,Xinrong Li,Xuankun Yang,Xun Zhang,Yan Li,Ye Lu,Yi Wang,Yibo Zhou,Yichi Zhang,Yihao Sun,Yijun Huang,Yixin Zhu,Yixuan Wu,Yuchen Sun,Yue Wu,Yuheng Sun,Yukun Li,Yutian Tu,Yuxuan Qin,Yuzhuo Wu,Zeyu Li,Zhengyu Lou,Zhenning Ran,Zizhu He,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students’ real academic workflows – homework, research projects, competitions, and personal projects – that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at this https URL.
[AI-22] Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
【速读】:该论文旨在解决深度学习模型中“捷径学习”(shortcut learning)问题,即模型过度依赖数据中的非本质特征(shortcut features)而非核心特征(core features),导致泛化能力下降的问题。其解决方案的关键在于引入演化博弈论(evolutionary game theory),将数据样本视为玩家、神经切线特征(neural tangent features)视为策略,并假设存在核心子网络和捷径子网络,从而形式化地刻画梯度下降(GD)与随机梯度下降(SGD)在训练过程中分别趋向于两种不同的稳态策略:前者主要优化捷径子网络,后者则主导核心子网络的优化。通过构建连续随机微分方程分析这些策略对捷径偏置形成的影响,揭示了数据噪声和优化噪声的作用机制,为从理论上理解并缓解捷径偏置提供了新视角。
链接: https://arxiv.org/abs/2605.02658
作者: Xiayang Li,Kuo Gai,Shihua Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 47pages, 5 figures
Abstract:Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theory to analyze the origins of shortcut bias by modeling data samples as players and their corresponding neural tangent features as strategies, assuming the existence of core and shortcut subnetworks. We find that gradient descent (GD) and stochastic gradient descent (SGD) lead to two distinct stochastically stable states, each corresponding to a different strategy. The former primarily optimizes the shortcut subnetwork, while the latter primarily optimizes the core subnetwork. We investigate the influence of these strategies on shortcut bias through a continuous stochastic differential equation, and reveal the impact of data noise and optimization noise on the formation of shortcut bias. In brief, our work employs evolutionary game theory to characterize the dynamics of shortcut bias formation and provides a theoretical view on its mitigation.
[AI-23] rustworthy AI Suffers from Invariance Conflicts and Causality is The Solution ICML’2026
【速读】:该论文旨在解决可信人工智能(Trustworthy AI)在多个核心目标(如公平性、鲁棒性、隐私性和可解释性)之间难以同时实现的问题,尤其是在保持模型效用的前提下。其解决方案的关键在于引入因果推理(causality),将这些目标之间的权衡重新诠释为数据生成过程发生不同变化时产生的不兼容不变性(invariance)要求。通过这一视角,因果框架能够统一理解可信AI权衡的来源,并借助选择性不变性(selective invariance)软化或化解这些冲突,从而为传统机器学习模型和大规模基础模型(Foundation Models, FMs)提供更具可行性的优化路径。
链接: https://arxiv.org/abs/2605.02640
作者: Ruta Binkyte,Ivaxi Sheth,Zhijing Jin,Mohammad Havaei,Bernhard Schölkopf,Mario Fritz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML’2026
Abstract:As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), is increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise, and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Our paper discusses how causal assumptions may be applied explicitly or implicitly in modern large-scale systems. Finally, we outline open challenges and opportunities for using causality to build more trustworthy AI.
[AI-24] SCGNN: Semantic Consistency enhanced Graph Neural Network Guided by Granular-ball Computing
【速读】:该论文旨在解决图表示学习中节点间语义一致性捕捉效率低、噪声敏感及可扩展性差的问题。现有方法依赖k近邻(k-Nearest Neighbors, kNN)或节点级全搜索算法(Full Search Algorithm, FSA)进行成对相似性计算,导致计算复杂度高且邻居选择僵化,难以适应大规模图结构并引入冗余连接。其解决方案的关键在于提出一种可插拔的语义一致性增强图神经网络(Semantic Consistency enhanced Graph Neural Network, SCGNN),通过粒度球计算(Granular-Ball Computing, GBC)自适应地将节点划分为粒度球,从而在组级别建模语义结构,显著降低计算开销并提升对噪声的鲁棒性;同时设计双重增强策略:一是结构增强模块构建基于锚点的图结构,将粒度球所承载的组级语义注入图结构;二是监督增强模块通过标签一致性检查(Label Consistency Checking, LCC)融合GBC预测与模型伪标签,生成更可靠的监督信号,从而有效指导参数更新。
链接: https://arxiv.org/abs/2605.02617
作者: Genhao Tian,Taihua Xu,Shuyin Xia,Qinghua Zhang,Jie Yang,Jianjun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Capturing semantic consistency among nodes is crucial for effective graph representation learning. Existing approaches typically rely on k -nearest neighbors ( k NN) or other node-level full search algorithms (FSA) to mine semantic relationships via exhaustive pairwise similarity computation, which suffer from high computational complexity and rigid neighbor selection, limiting scalability and introducing noisy connections. In this paper, we propose the Semantic Consistency enhanced Graph Neural Network (SCGNN), a novel plug-and-play framework that leverages granular-ball computing (GBC) to efficiently capture semantic consistency in a scalable manner. Unlike node-level FSA methods, SCGNN models group-level semantic structure by adaptively partitioning nodes into granular balls, significantly reducing computational cost while improving robustness to noise. To effectively utilize the discovered group-level semantic consistency, we design a dual enhancement strategy. Specifically, (1) a structure enhancement module constructs an anchor-based graph structure, where each anchor is a virtual node representing the group-level semantic carried by a granular ball, then injecting group-level semantic information into the graph structure; and (2) a supervision enhancement module performs label consistency checking (LCC) by combining GBC predictions with model-generated pseudo-labels, thereby producing more reliable supervision signals. SCGNN is compatible with various GNN backbones. During the forward propagation of SCGNN, the vanilla graph and the augment graph are jointly encoded, and their predictions are fused; during the backpropagation, the supervision enhancement module provides enhanced supervision signals to guide parameter updates.
[AI-25] Counterfactual Reasoning in Automated Planning
【速读】:该论文旨在解决自动化规划(automated planning)中固有假设过于理想化的问题,即传统方法假设任务的初始状态、目标和可用动作在规划前已完全确定,这在规则固定且执行确定性的领域中有效,但在现实世界中往往缺乏灵活性,难以应对突发情况或优化结果。解决方案的关键在于引入反事实推理(counterfactual reasoning),通过系统性地分析在何种元素(如初始状态、目标或动作)被改变、何时触发推理以及为何及如何进行这些变更,来增强规划系统的适应性和鲁棒性。该研究对现有工作进行了分类梳理,并指出了未来研究方向,为实现更具情境感知与动态调整能力的智能规划提供了理论框架。
链接: https://arxiv.org/abs/2605.02603
作者: Alberto Pozanco,Daniel Borrajo,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated planning traditionally assumes that all aspects of a planning task (initial state, goals, and available actions) are fully specified in advance, an approach well-suited to domains with fixed rules and deterministic execution. However, real-world planning often requires flexibility, allowing for deviations from the original task parameters in response to unforeseen circumstances or to improve outcomes. This paper surveys existing works on counterfactual reasoning in automated planning, categorizing them by what elements are changed, when the reasoning is triggered, and why and how these changes are made. We conclude by discussing key findings and outlining open research questions to guide future work in this area.
[AI-26] CoRAL: Contact-Rich Adaptive LLM -based Control for Robotic Manipulation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在接触丰富的操作任务中应用受限的问题,主要挑战在于其缺乏显式的物理接地能力以及无法实现自适应控制。解决方案的关键在于提出一种模块化框架 CoRAL(Contact-Rich Adaptive LLM-based control),通过将高层语义推理与底层控制解耦,使 LLM 不作为直接控制器,而是作为成本函数设计者,为基于采样的运动规划器(MPPI)生成情境感知的目标函数;同时引入神经符号适应环路,利用 VLM 提供的语义先验(如质量与摩擦估计)并通过在线系统辨识实时优化物理参数,结合交互反馈迭代调整成本结构以修正策略错误,并借助检索式记忆单元复用成功策略,从而在保证实时控制稳定性的同时显著提升未见场景下的接触丰富任务成功率。
链接: https://arxiv.org/abs/2605.02600
作者: Berk Çiçek,Mert K. Er,Özgür S. Öğüz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026
Abstract:While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
[AI-27] Foundation-Model-Based Agents in Industrial Automation: Purposes Capabilities and Open Challenges
【速读】:该论文旨在解决工业场景中基于基础模型(Foundation Models)的智能体系统(agent systems)当前发展状态不清晰、功能特性差异不明以及持续存在的局限性缺乏系统归纳的问题。其解决方案的关键在于通过遵循PRISMA 2020指南的系统性文献综述方法,对2341篇文献进行筛选并结构化编码,最终提炼出88篇高质量研究,从而量化分析这些系统在技术就绪度(TRL)、应用场景、能力特征(如人机交互、不确定性处理等)及主要限制因素(如泛化能力不足、幻觉和推理延迟)方面的现状,并提出一个融合传统代理理论、自动化工程标准与基础模型范式的工业级智能体工作定义,为后续研究与应用提供明确方向。
链接: https://arxiv.org/abs/2605.02592
作者: Vincent Henkel,Felix Gehlhoff,David Kube,Asaad Almutareb,Luis Cruz,Bernd Hellingrath,Philip Koch,Christoph Legat,Florian Mohr,Michael Oberle,Felix Ocker,Thorsten Schoeler,Mario Thron,Nico Andre Töpfer,Lucas Vogt,Yuchen Xia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 8 figures, 1 table. Submitted to Journal of Intelligent Manufacturing for peer review. A comparison of classical agent applications and foundation-model based agents is presented
Abstract:Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision support, process monitoring, and engineering automation. Yet evidence on their purposes, capabilities, and limitations remains fragmented across domains. This work examines how mature foundation-model-based agent systems are in industrial contexts, how their functional profile differs from conventional agent systems, and which limitations persist. A systematic literature survey following the PRISMA 2020 guideline is presented, screening 2,341 publications and synthesising a corpus of 88 publications through a structured coding scheme. The results show that reported systems are predominantly at prototype and early validation stages (75.0% at TRL 4-6), with deployment-oriented evidence remaining rare (9.1%). Operational goals are most frequently positioned in user assistance, monitoring, and process optimisation, while conventional production-control purposes such as planning and scheduling are less prominent. Compared with an established baseline for industrial agent systems, the capability profile reveals substantial gains in human interaction (+37%) and dealing with uncertainty (+35%), but a pronounced deficit in negotiation (-39%). The most widely reported limitations concern lack of generalization, hallucination and output instability, data scarcity, and inference latency. A working definition of foundation-model-based industrial agents is also proposed, bridging conventional agent theory, automation-engineering standards, and the foundation-model paradigm.
[AI-28] Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
【速读】:该论文旨在解决深度神经网络中非线性激活函数设计面临的优化稳定性与计算效率难以兼顾的问题。现有方法中,分段线性函数虽具备推理速度快的优势,但因在原点处不可微导致优化不稳定;而平滑函数通常依赖超越运算,带来显著的计算开销。解决方案的关键在于提出一种基于构造逼近理论的通用平滑框架,并引入伯恩斯坦线性单元(Bernstein Linear Unit, BerLU),其利用伯恩斯坦多项式构建一个可微的二次过渡区域,在消除奇异性的同时保持分段线性结构,理论上保证了严格连续可微性和Lipschitz常数为1的非扩张性,从而确保梯度传播稳定并避免深层网络中的梯度爆炸问题。
链接: https://arxiv.org/abs/2605.02591
作者: Wentao Zhang,Yutong Zhang,Yifan Zhu,Wentao Mo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to 2026 International Conference On Intelligent Computing (ICIC 2026)
Abstract:The efficacy of deep neural networks is heavily reliant on the design of non-linear activation functions, yet existing approaches often struggle to balance optimization stability with computational efficiency. While piecewise linear functions offer inference speed, they suffer from optimization instability due to non-differentiability at the origin, whereas smooth counterparts typically incur significant computational overhead through their reliance on transcendental operations. To address these limitations, this paper proposes a general smoothing framework based on constructive approximation theory and introduces the Bernstein Linear Unit (BerLU). This novel activation function utilizes Bernstein polynomials to construct a differentiable quadratic transition region that effectively eliminates singularities while maintaining a piecewise linear structure. Theoretical analysis demonstrates that the proposed method guarantees strictly continuous differentiability and a non-expansive Lipschitz constant of one, which ensures stable gradient propagation and prevents the gradient explosion problems common in deep architectures. Comprehensive empirical evaluations across representative Vision Transformer and Convolutional Neural Network architectures confirm that this approach consistently outperforms state-of-the-art baselines on standard image classification benchmarks while delivering superior computational and memory efficiency.
[AI-29] Beyond State Machines: Executing Network Procedures with Agent ic Tool-Calling Sequences
【速读】:该论文旨在解决如何利用基于大语言模型(Large Language Model, LLM)的网络智能体(network AI agent)高效、可靠地执行移动通信系统中复杂的多步骤网络操作流程的问题。其核心挑战在于平衡执行延迟与正确性,尤其是在涉及多个工具调用(tool invocation)的序列化任务中。解决方案的关键在于设计不同的代理-工具协作范式:相较于依赖代理端迭代推理的方法(导致高延迟和易出错),将完整流程封装于单一工具内部并由该工具协调子步骤调用的方式显著降低延迟并提升可靠性;同时,研究引入了一种面向具体流程的错误分类体系(procedure-specific error taxonomy),以系统化分析多步执行中的失败模式,从而揭示LLM在复杂工具调用工作流中的实际执行边界。
链接: https://arxiv.org/abs/2605.02584
作者: Purna Sai Garigipati,Onur Ayan,Kishor Chandra Joshi,Xueli An
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision-making across the network. This work studies how Large Language Model (LLM)-based network AI agents can be utilized to execute network procedures expressed as sequences of tool invocations. We investigate four approaches, which differ in how the agent obtains the procedure and in how execution is distributed between the agent and the underlying tools. We evaluated the latency and execution correctness across these approaches using a User Equipment (UE) IP allocation procedure as a case study. Furthermore, we conduct a stress test to examine how many sequential procedural steps an LLM agent can reliably execute before failure. Our results show that approaches relying on iterative agent-side reasoning incur higher latency and are more prone to execution errors, while approaches where the procedure is encapsulated within a single tool, which internally orchestrates the required steps by invoking other tools, reduce latency by limiting repeated reasoning. The stress-test results further show that the model with advanced tool-calling capability maintains reliable execution over longer procedures than the other evaluated models; however, all models exhibit reliability degradation as procedure length increases, revealing clear execution limits in multi-step tool-based workflows. To systematically analyze failures in procedure execution, we introduce a procedure-specific error taxonomy that categorizes deviations in multi-step procedural execution.
[AI-30] On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length ICML2026
【速读】:该论文旨在解决长任务序列(即高任务时域长度,horizon length)对大型语言模型(Large Language Models, LLMs)训练稳定性与性能的负面影响问题。研究发现,单纯增加任务时域长度会引发严重的训练不稳定性,其根源在于探索困难和信用分配挑战。论文提出的关键解决方案是“时域缩减”(horizon reduction),即在训练阶段通过缩短任务动作序列长度来稳定训练过程,并提升模型在长时域任务中的表现。进一步研究表明,这种时域缩减策略还能促进模型在不同任务时域长度之间的泛化能力,即“时域泛化”(horizon generalization),使得模型在推理阶段能够更有效地适应比训练时域更长的任务。
链接: https://arxiv.org/abs/2605.02572
作者: Sunghwan Kim,Junhee Cho,Beong-woo Kwak,Taeyoon Kwon,Liang Wang,Nan Yang,Xingxing Zhang,Furu Wei,Jinyoung Yeo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2026
Abstract:Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on system-level optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions. Specifically, we construct controlled tasks in which agents face identical decision rules and reasoning structures, but differ only in the length of action sequences required for successful completion. Our results reveal that increasing horizon length alone constitutes a training bottleneck, inducing severe training instability driven by exploration difficulties and credit assignment challenges. We demonstrate that horizon reduction is a key principle to address this limitation, stabilizing training and achieving better performance in long-horizon tasks. Moreover, we find that horizon reduction is related to stronger generalization across horizon lengths: models trained under reduced horizons generalize more effectively to longer-horizon variants at inference time, a phenomenon we refer to as horizon generalization.
[AI-31] Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability CEC
【速读】:该论文旨在解决化疗剂量优化中因患者状态部分可观测(partial observability)而导致的决策难题,即在临床实践中难以获取完整的生理状态信息时,如何通过强化学习实现更有效的治疗策略。其关键解决方案是引入基于循环神经网络(Recurrent Neural Network, RNN)的记忆增强型策略,具体采用带有独立LSTM(Long Short-Term Memory)结构的Actor-Critic架构的TD3算法,在AhmChemoEnv基准环境下验证了该方法相较于传统前馈策略在部分可观测条件下的显著优势:不仅提升了肿瘤抑制的一致性,还增强了对正常细胞的保护能力,证明记忆机制能有效缓解状态不完整或观测噪声带来的不确定性影响。
链接: https://arxiv.org/abs/2605.02552
作者: Firas Mohamed Elamine Kiram,Imane Youkana,Rachida Saouli,Gian Antonio Susto,Laid Kahloul
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the VI. International Conference on Electrical, Computer and Energy Technologies (ICECET 2026)
Abstract:Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability. To this end, we employ a recurrent TD3-based approach with separate LSTM actor-critic networks and evaluate it on the AhnChemoEnv benchmark from DTR-Bench, considering both off-policy and on-policy recurrent architectures against feed-forward TD3 and Soft Actor-Critic. Pharmacokinetic and pharmacodynamic variability are held fixed to isolate hidden-state uncertainty and observation noise and to avoid confounding effects from inter-patient variability. Across ten random seeds, recurrence yields modest benefit under full observability but substantially stronger and more stable performance under partial observability, with more consistent tumor suppression and improved normal-cell preservation. These findings indicate that memory-based policies are particularly beneficial when clinically relevant state information is incomplete or noisy.
[AI-32] Double Rectified Linear Unit-based Modular Semantics for Quantitative Bipolar Argumentation Framework
【速读】:该论文旨在解决定量双极论证框架(Quantitative Bipolar Argumentation Frameworks, QBAFs)中论证可接受性计算存在的分歧与非直观结果问题,尤其是在简单无环情形下不同语义仍可能产生不一致的结论。其解决方案的关键在于提出一种新的渐进语义(gradual semantics),该语义不仅在理论上满足文献中已有的理性后设条件(rationality postulates),而且能更贴近人类直觉地评估论证强度;此外,作者进一步证明了该语义在无环QBAFs以及更广泛的循环框架中均具有收敛性,从而显著提升了QBAFs在实际应用中的稳定性和可靠性。
链接: https://arxiv.org/abs/2605.02551
作者: Gianvincenzo Alfano,Sergio Greco,Lucio La Cava,Francesco Parisi,Irina Trubitsyna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Quantitative Bipolar Argumentation Frameworks (QBAFs) provide an alternative approach to computing argument acceptability in Bipolar Argumentation Frameworks (BAFs). Each argument is assigned an initial strength, which is then updated to a final strength by considering the influence of both its attackers and supporters. Over the years, several semantics have been proposed to compute argument acceptability in QBAFs, yet they often yield divergent or counterintuitive results, even for simple acyclic cases. We introduce novel gradual semantics for QBAFs that address these limitations, producing results that align more closely with intuitive expectations, while satisfying established rationality postulates from the literature. Furthermore, we study its convergence behavior, proving that it converges not only for acyclic QBAFs but also for broader classes of cyclic frameworks.
[AI-33] Strategy-Aware Optimization Modeling with Reasoning LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动优化建模中因缺乏对建模策略(Modeling Strategy)显式认知而导致的公式错误和求解效率低下的问题。其解决方案的关键在于提出SAGE框架,通过构建一个经求解器验证的多策略数据集,并采用监督微调与分段加权GRPO(Segment-Weighted GRPO)相结合的训练方法,利用格式合规性、正确性和求解效率的复合奖励信号来指导学生模型学习有效的建模策略。该设计使模型不仅生成语法正确的优化程序,还能发现更高效、结构更紧凑的约束系统,从而显著提升自动化优化建模的质量和多样性。
链接: https://arxiv.org/abs/2605.02545
作者: Ruiqing Zhao,Fengzhi Li,Yuan Zuo,Rui Liu,Yansong Liu,Yunfei Ma,Fanyu Meng,Junlan Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can generate syntactically valid optimization programs, yet often struggle to reliably choose an effective modeling strategy, leading to incorrect formulations and inefficient solver behavior. We propose SAGE, a strategy-aware framework that makes Modeling Strategy explicit in both data construction and post-training. SAGE builds a solver-verified multi-strategy dataset and trains a student model with supervised fine-tuning followed by Segment-Weighted GRPO using a composite reward over format compliance, correctness, and solver efficiency. Across eight benchmarks spanning synthetic and real-world settings, SAGE improves average pass@1 from 72.7 to 80.3 over the strongest open-source baseline. With multiple generations, SAGE discovers more distinct correct formulations and improves component-level diversity at pass@16 by 19-29%. At the largest scale, SAGE produces more compact constraint systems with 14.2% fewer constraints than the baseline, consistent with solver-efficient modeling. Overall, these results show that making Modeling Strategy explicit improves automated optimization modeling. Code is available at this https URL.
[AI-34] Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation
【速读】:该论文旨在解决非凸室内场景中3D自主生成失效的问题,尤其针对数据驱动的生成模型缺乏拓扑先验、迭代式代理导致语义碎片化和几何脆弱性等挑战。其核心解决方案是提出ZoneMaestro框架,通过将传统以物体为中心的合成范式转变为基于区域(Zone)的图结构编排(Zone-Graph Orchestration),内化一种新颖的区域逻辑,将高层语义意图映射为功能区域与拓扑约束,从而实现对多样化建筑形式的鲁棒适应。关键创新在于引入交替对齐策略(Alternating Alignment Strategy),在推理内化与区域感知的组相对策略优化(Z-GRPO)之间循环迭代,有效调和语义丰富性与几何有效性之间的矛盾,且无需依赖外部物理引擎。
链接: https://arxiv.org/abs/2605.02537
作者: Meisheng Zhang,Shizhao Sun,Yang Zhao,Ziyuan Liu,Zhijun Gao,Jiang Bian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration. By internalizing a novel zone-based logic, ZoneMaestro translates high-level semantic intent into functional zones and topological constraints, enabling robust adaptation to diverse architectural forms. To support this, we construct Zone-Scene-10K, a large-scale dataset enriched with explicit Zone-Graph annotations. We further introduce an Alternating Alignment Strategy that cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization (Z-GRPO), effectively reconciling the tension between semantic richness and geometric validity without relying on external physics engines. To rigorously evaluate spatial intelligence beyond convex primitives, we formally define the task of Intricate Spatial Orchestration and release SCALE, a stress-test benchmark for irregular indoor scenarios with complex, dense spatial relations. Extensive experiments demonstrate that ZoneMaestro resolves the density-safety dichotomy, significantly outperforming state-of-the-art baselines in both structural coherence and intent adherence.
[AI-35] Set-Based Training of Neural Barrier Certificates for Safety Verification of Dynamical Systems
【速读】:该论文旨在解决动态系统安全验证中传统迭代训练与验证流程效率低下的问题。现有方法通过反复训练神经网络并进行形式化验证来合成屏障证书(barrier certificate),这一过程耗时且难以扩展。其解决方案的关键在于提出一种基于集合的训练方法,通过设计一个集成了所有屏障证书性质的集合损失函数(set-based loss function),将验证过程直接嵌入训练阶段;当损失值为零时,可形式化证明屏障证书的有效性,从而将原本分离的训练与验证步骤融合为单一训练流程,显著提升效率并更好地处理高维非线性系统。
链接: https://arxiv.org/abs/2605.02526
作者: Miriam Kranzlmüller,Lukas Koller,Tobias Ladner,Matthias Althoff
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Barrier certificates are scalar functions over the state space of dynamical systems that separate all unsafe states from all reachable states. The existence of a barrier certificate formally verifies the safety of the dynamical system. Recent approaches synthesize barrier certificates by iteratively training a neural network. In each iteration, the candidate is formally verified - if successful, the barrier certificate is found. Instead, we propose a set-based training approach that tightly integrates verification into training via a set-based loss function that soundly encodes all barrier certificate properties. A loss of zero formally proves the validity of the barrier certificate, collapsing the iterative training and verification into a single training procedure. Our experiments demonstrate that our set-based training approach scales well with the system dimension and naturally handles complex nonlinear dynamics.
[AI-36] A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
【速读】:该论文旨在解决室内自主移动机器人在执行导航任务时缺乏对自然语言指令语义理解能力的问题,即现有系统(如ROS 2 Navigation 2)虽能可靠地定位至度量坐标,但无法解析表达意图而非具体位置的语言指令。其关键解决方案是提出“语义自主栈”(Semantic Autonomy Stack),一个六层参考框架,核心创新在于采用混合确定性-视觉语言模型(Vision-Language Model, VLM)推理机制与跨机器人自适应记忆系统:通过七步参数化解析器,在无需调用VLM、摄像头或GPU的情况下处理88%的指令(延迟<0.1毫秒);仅对真正模糊的指令触发VLM推理;同时设计五类语义记忆结构并引入显式作用域分类(全局环境知识、单操作员偏好、单机器人能力),实现跨会话学习和跨机器人知识迁移——通过共享编译摘要将VLM交互中习得的偏好转化为确定性规则,并在不同机器人间传递,从而实现103,000倍的延迟降低,且在无GPU的Raspberry Pi 5平台上完成实验验证。
链接: https://arxiv.org/abs/2605.02525
作者: Bogdan Felician Abaza,Andrei-Alexandru Staicu,Cristian Vasile Doicin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 33 pages, 11 figures, 14 tables
Abstract:Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.
[AI-37] A Novel Preprocessing-Driven Approach to Remaining Useful Life (RUL) Prediction Using Temporal Convolutional Networks (TCN)
【速读】:该论文旨在解决航空发动机剩余使用寿命(Remaining Useful Life, RUL)预测中因数据预处理不足而导致模型性能受限的问题。现有深度学习方法多聚焦于模型架构设计,忽视了输入特征在预处理阶段的质量与时间表征对预测精度的影响。其解决方案的关键在于提出一种新颖的数据预处理流程,通过利用完整的时序数据并生成每个时间步的RUL估计值,增强数据质量与时间动态建模能力,使神经网络能够捕捉更精细的退化过程,从而实现连续、高精度的预测性维护支持。
链接: https://arxiv.org/abs/2605.02507
作者: Florent Imbert,Tosin Adewumi,Hui Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of Remaining Useful Life (RUL) in aero-engines is vital for predictive maintenance, improved operational reliability, and reduced lifecycle costs. While deep learning approaches have demonstrated strong potential in this area, most existing methods focus primarily on model architecture design and treat input features uniformly, often neglecting the influence of data preprocessing. In this work, we propose a novel preprocessing pipeline that enhances RUL prediction by improving data quality and temporal representation before model training. Our approach leverages complete temporal sequences and generates RUL estimates at each timestep, enabling the model to capture fine-grained degradation dynamics and deliver continuous prognostic insights throughout the engine’s operational life. To validate the effectiveness of the proposed pipeline, we conduct experiments on the NASA C-MAPSS dataset. Comparative evaluations against a suite of state-of-the-art neural models including CNN, RNN, LSTM, DCNN, TCN, BiGRU-TSAM, AGCNN, and ATCN, demonstrate that our approach consistently achieves superior accuracy and robustness in aero-engine RUL prediction. These results highlight the critical role of preprocessing in maximizing the effectiveness of neural prognostic models.
[AI-38] DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
【速读】:该论文旨在解决当前自主数据分析代理(autonomous data analysis agents)在探索性数据分析场景下评估不足的问题,尤其针对现有基准测试过于关注最终答案准确性而缺乏对推理过程进行细致评估的局限性。其解决方案的关键在于提出一个面向过程的基准测试工具 DataClaw,该工具包含约 206 万条真实世界记录(涵盖企业、工业和政策领域),并保留原始数据噪声;同时设计了 492 个跨领域的任务,源自智库咨询场景,并为每个任务标注中间里程碑,从而支持对代理推理路径的细粒度诊断。这一设计使得 DataClaw 能够量化代理在分析流程中的进展程度及推理中断位置,揭示模型虽答案错误但存在部分有效推理的现象,以及不同模型间探索策略的差异,为评估和改进自主数据分析代理的能力边界提供了更贴近实际应用的诊断平台。
链接: https://arxiv.org/abs/2605.02503
作者: Qiaohong Zhang,Weihao Ye,Jialong Chen,Yi Luo,BoYuan Li,Bowen Deng,Zibin Zheng,Jianhao Lin,Wei-Shi Zheng,Chuan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data settings and provide limited support for reasoning process evaluation. We introduce DataClaw, a process-oriented benchmark for exploratory real-world data analysis. DataClaw contains approximately 2.06 million real-world records across enterprise, industry and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones for process-level evaluation. These annotations allow DataClaw to measure how far an agent progresses and where its reasoning breaks down. Experiments with eight advanced LLMs show that current agents remain far from reliable in this setting, with seven models achieving below 50% overall accuracy. Process analysis further reveals partial progress hidden behind wrong answers and distinct exploration strategies across models. Overall, DataClaw provides a less data constrained diagnostic testbed for probing the capability boundaries of autonomous data-analysis agents.
[AI-39] Pretraining on Sleep Data Improves non-Sleep Biosignal Tasks
【速读】:该论文旨在解决睡眠生物信号(sleep biosignals)是否能够作为有效的预训练分布,从而学习出可迁移至相邻领域(如非睡眠脑电图 EEG 和心电图 ECG)的通用表示方法这一问题。其解决方案的关键在于采用仅基于睡眠数据的多模态对比预训练策略(sleep-only multimodal contrastive pretraining),并引入留一法(leave-one-out)目标函数以增强模型对跨任务和跨数据集泛化能力;实验表明,该方法在多个非睡眠 EEG 和 ECG 的下游任务中均显著优于从零开始训练,并达到或超越现有专用模型及基础模型的性能水平。
链接: https://arxiv.org/abs/2605.02500
作者: William Lehn-Schiøler,Magnus Ruud Kjær,Phillip Hempel,Magnus Guldberg Pedersen,Rahul Thapa,Bryan He,Nicolai Spicher,Andreas Brink-Kjaer,Lars Kai Hansen,Emmanuel Mignot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 10 tables
Abstract:Sleep foundation models have recently demonstrated strong performance on in-domain polysomnography tasks, including sleep staging, apnea detection, and disease risk prediction. In this work, we investigate whether sleep biosignals can serve as an effective pretraining distribution for learning representations that transfer beyond sleep to adjacent domains. Following sleep foundation models, we perform sleep-only multimodal contrastive pretraining (with a leave-one-out objective) and evaluate transfer to non-sleep EEG and ECG, two well-benchmarked biosignal modalities with heterogeneous datasets and clinically meaningful downstream tasks. Across eight downstream tasks spanning multiple EEG and ECG datasets, sleep pretraining consistently improves performance relative to training from scratch. Moreover, on several tasks, we achieve performance competitive with or surpassing prior specialized state-of-the-art and foundation models.
[AI-40] Efficient Preference Poisoning Attack on Offline RLHF
【速读】:该论文旨在解决基于人类反馈的离线强化学习(Offline Reinforcement Learning from Human Feedback, RLHF)中,如对数线性直接偏好优化(log-linear DPO)方法在面对偏好标签翻转攻击(preference poisoning attack)时的脆弱性问题。其核心挑战在于如何以最小的标签扰动实现对模型参数的有效误导,从而破坏DPO的学习过程。解决方案的关键在于发现:单个偏好标签的翻转会引发DPO梯度的一个与模型参数无关的偏移(parameter-independent shift),这一性质使得目标攻击问题可被转化为一个结构化的二值稀疏逼近问题。基于此洞察,作者提出了两种攻击方法——Binary-Aware Lattice Attack (BAL-A) 和 Binary Matching Pursuit Attack (BMP-A),分别利用格点优化和二值匹配追踪策略,在保证二值系数的前提下实现最小翻转目标,并提供理论保障和鲁棒性证书,验证了字典几何结构对攻击成功率的决定性影响。
链接: https://arxiv.org/abs/2605.02495
作者: Chenye Yang,Weiyu Xu,Lifeng Lai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai’s nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for K -flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.
[AI-41] Efficient Temporal Datalog Materialisation for Composite Event Recognition
【速读】:该论文旨在解决多类事件规范语言(event specification languages)在流式推理框架中缺乏统一表达与比较基础的问题,从而阻碍了其在高吞吐符号事件流中对关键情境(如安全威胁和透明度问题)的及时检测。解决方案的关键在于将主流事件规范语言的实用片段映射到Temporal Datalog–(一种带有分层否定且无未来依赖的时序Datalog),并提出Streaming Trigger Graphs——一种基于先进Datalog材料化技术的扩展机制,以支持高效流式推理。该方法构建了一个统一的复合事件识别机制,具备跨多种实际事件规范语言的泛化潜力。
链接: https://arxiv.org/abs/2605.02488
作者: Periklis Mantenoglou
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Logic in Computer Science (cs.LO)
备注:
Abstract:Several applications demand the timely detection of critical situations, such as threats to safety and transparency, over high-velocity streams of symbolic events. This demand has motivated the development of (i) event specification languages, which define composite events via temporal patterns over simpler events, and (ii) stream reasoning frameworks, evaluating patterns expressed in these languages. However, event specification languages are typically studied in isolation, complicating their comparison in terms of expressivity and obscuring the scope of their associated stream reasoners. To mitigate this issue, we map practical fragments of prominent event specification languages into Temporal Datalog–, a temporal Datalog with stratified negation and no future dependencies. To support efficient stream reasoning over Temporal Datalog–, we propose Streaming Trigger Graphs, an extension of a state-of-the-art technique for Datalog materialisation. Our approach yields a uniform composite event recognition mechanism that has the potential to generalise across a wide range of practical event specification languages.
[AI-42] Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT Finite One-Shot Gaps and Policy Mirror Descent
【速读】:该论文旨在解决在线强化学习中可验证奖励(Online Reinforcement Learning with Verifiable Rewards, RLVR)的训练效率问题,特别是如何在不牺牲性能的前提下减少优化路径上的计算开销。传统RLVR方法需在每轮迭代中同时进行轨迹生成、验证器评分与参考策略评估,导致资源消耗大且难以扩展。为此,作者提出一种基于参考采样的加权监督微调(Reference-sampled Weighted Supervised Fine-Tuning, SFT)目标函数,其诱导出的策略恰好等价于固定参考策略下的KL正则化RLVR最优解——即玻尔兹曼目标策略(Boltzmann target policy),该策略通过对参考策略按验证奖励进行指数倾斜得到。关键在于:通过匹配SFT诱导策略与这一目标策略,可以唯一确定密度比权重,从而在参考采样子类中将权重简化为提示归一化的玻尔兹曼权重 exp(r(x,y)/β)/Z(x),并据此设计出BOLT(Boltzmann-Targeted SFT)算法作为该投影的无偏估计器。此方案不仅理论清晰,还揭示了覆盖范围、温度参数与方差之间的权衡关系,并通过有限单次运行分析分离出各项误差来源,为实际应用提供了可解释且高效的优化路径。
链接: https://arxiv.org/abs/2605.02469
作者: Yao Shu,Chenxing Wei,Hongbin Lin,Shuang Qiu,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight \exp(r(x,y)/\beta)/Z(x) . BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price \beta\log(1/\pi^*(S_N\mid x)) from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature–coverage–variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.
[AI-43] LLM -Assisted Repository-Level Generation with Structured Spec-Driven Engineering
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中从函数级向项目级(repository-level)扩展时输出质量显著下降的问题。现有基于自然语言提示的 workflows 存在固有歧义性和缺乏可验证性,难以支撑复杂系统级代码的可靠生成。解决方案的关键在于提出结构化规范驱动工程(Structured Spec-Driven Engineering, SSDE),即利用结构化规约(structured specifications)作为输入来引导 LLM 的代码生成过程。这种方法不仅提升了生成代码的质量和一致性,还增强了结果的可验证性,从而为实现高质量、可信赖的项目级代码生成提供了可行路径。
链接: https://arxiv.org/abs/2605.02455
作者: Shuzhao Feng,Boqi Chen,Brett H Meyer,Gunter Mussbacher
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to the 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE Companion '26)
Abstract:State-of-the-art Large Language Models (LLMs) excel in code generation at the function level. However, the output quality significantly declines when scaling to repository-level systems. Current workflows relying only on natural language prompts suffer from inherent ambiguity and a lack of verifiability. To address this, we propose structured spec-driven engineering (SSDE), a paradigm that leverages structured artifacts to guide LLM generation. We argue that structured specifications as LLM inputs make high-quality, repository-level code generation a tangible goal, while at the same time offering superior verifiability, leading to significant potential for improvement. We first investigate the feasibility of this vision through a pilot study generating Model-View-Controller (MVC) business logic for three software systems using five LLMs, and then highlight the potential, challenges, and future roadmap for SSDE.
[AI-44] Causal Software Engineering: A Vision and Roadmap
【速读】:该论文旨在解决软件工程中日益依赖相关性模型(correlational models)进行决策时所面临的局限性,尤其是在高风险情境下难以回答干预性或反事实问题(如“改变负载均衡策略会产生什么影响?”或“若采用不同发布计划是否可避免故障?”)。当前AI驱动的支持工具(如异常检测、预测分析、AIOps及大语言模型代理)虽能增强模式识别与内容生成能力,但无法提供因果推断所需的机制。解决方案的关键在于提出**因果软件工程(Causal Software Engineering, CSE)**这一新范式,其核心是将因果模型与因果推理系统性地融入软件生命周期的各个阶段,通过显式假设建模、不确定性感知的效果估计以及反事实诊断,实现从“相关”到“因果”的跃迁,从而提升决策质量与可解释性。
链接: https://arxiv.org/abs/2605.02454
作者: Roberto Pietrantuono,Luca Giamattei,Stefano Russo,Julien Siebert,Neil Walkinshaw
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at FSE 2026 - Ideas, Visions and Reflections (IVR) Track
Abstract:Software engineering increasingly involves making high-stakes decisions under uncertainty, using signals from code, field data, and socio-technical processes. Recent AI-driven support (e.g., anomaly detection, predictive analytics, AIOps, as well as LLM-based agents) has amplified engineers’ ability to detect patterns and synthesize content and recommendations, but many critical questions are interventional or counterfactual: What is the expected impact of changing a load-balancing strategy? Would an outage have been avoided under a different release plan? Correlational models answer “what tends to co-occur”; they struggle to answer “what would happen if we act.” We propose Causal Software Engineering (CSE) as a future paradigm in which causal models and causal reasoning systematically inform activities across the software lifecycle, augmenting existing practices with explicit assumptions, uncertainty-aware effect estimates, and counterfactual diagnosis. We outline (i) a causal-first workflow view spanning development and operations, (ii) a staged roadmap for tools and organizational adoption, and (iii) an evaluation and benchmark agenda for measuring progress.
[AI-45] Position: How can Graphs Help Large Language Models ?
【速读】:该论文旨在解决“如何利用图结构增强大语言模型(Large Language Models, LLMs)性能”的问题,与现有研究侧重于LLMs对图学习任务的赋能不同,本文从反向视角探讨图对LLMs的提升作用。解决方案的关键在于三个维度:其一,利用图结构作为动态知识源以缓解LLMs的幻觉问题;其二,引入基于图的提示技术(如思维链Chain-of-Thought、树状思维Tree-of-Thought和图状思维Graph-of-Thought)增强推理能力;其三,通过将图结构嵌入LLMs架构中,提升其对结构化数据(如电商数据、代码和关系型数据库)的理解能力,从而拓展LLMs在专业领域的应用边界。
链接: https://arxiv.org/abs/2605.02452
作者: Xiyuan Wang,Yi Hu,Yanbo Wang,Chuan Shi,Muhan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: { https://doi.org/10.1007/s11704-026-51651-6 }
Abstract:With the rapid advancement of large language models (LLMs), classic graph learning tasks have greatly benefited from LLMs, including improved encoding of textual features, more efficient construction of graphs from text, and enhanced reasoning over knowledge graphs. In this paper, we ask a complementary question: How can graphs help LLMs? We address this question from three perspectives: 1) graphs provide an up-to-date knowledge source that helps reduce LLM hallucinations, 2) graph-based prompting techniques-such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT)-enhance LLM reasoning capabilities, and 3) integrating graphs into LLMs improves their understanding of structured data, expanding their applicability to domains such as e-commerce, code, and relational databases (RDBs). We further outlook some future directions including designing sparse LLM architectures based on graphs and brain-inspired memory systems.
[AI-46] he Model Knows the Decoder Finds: Future Value Guided Particle Power Sampling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在“无需训练的推理”(reasoning without training)场景中,尽管基础模型已对正确多步解具有非平凡的概率质量,但推理时难以高效定位这些高概率路径的问题。其核心挑战在于如何在不依赖微调的情况下,通过更有效的采样策略提升推理准确性与计算资源利用效率。解决方案的关键是提出辅助粒子幂采样(Auxiliary Particle Power Sampling, APPS),这是一种基于块级粒子的近似方法,通过提案修正的幂重加权(proposal-corrected power reweighting)并行传播假设,并在重采样边界使用未来价值引导的选择机制优化存活假设,从而在有限计算预算下动态分配资源给有潜力的前缀路径,同时提供可调控的粒子数量和可预测的峰值内存占用。
链接: https://arxiv.org/abs/2605.02427
作者: Tu Nguyen,Rasul Tutunov,Xiaotong Ji,Matthieu Zimmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:A recurring pattern in “reasoning without training” is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.02427 [cs.AI] (or arXiv:2605.02427v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-47] HeavySkill: Heavy Thinking as the Inner Skill in Agent ic Harness
【速读】:该论文旨在解决当前基于代理(Agent)的复杂推理任务中,性能提升机制不明确的问题,尤其是在多代理协作框架中,其内在驱动因素被复杂的系统设计所掩盖。解决方案的关键在于提出“HeavySkill”这一新视角,将重思考(heavy thinking)视为模型参数内部固化的内生技能,而非仅依赖外部调度层的执行单元;该技能表现为一个两阶段流水线——并行推理后总结(parallel reasoning then summarization),可嵌入任意代理架构之下,并通过强化学习进一步扩展其深度与宽度,从而实现无需依赖脆弱调度层的自演化大型语言模型(LLM)。
链接: https://arxiv.org/abs/2605.02396
作者: Jianing Wang,Linsen Guo,Zhengyu Chen,Qi Guo,Hongyu Zang,Wenjie Shi,Haoxiang Ma,Xiangyu Xi,Xiaoyu Li,Wei Wang,Xunliang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
Abstract:Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model’s parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.
[AI-48] Controllable and Verifiable Process Data Synthesis for Process Reward Models
【速读】:该论文旨在解决过程奖励模型(Process Reward Models, PRMs)训练中高质量过程监督数据稀缺且难以控制的问题,尤其是现有方法在错误位置、错误类型和轨迹一致性方面缺乏可控性与可验证性。解决方案的关键在于提出一个可控制且可验证的框架,用于合成过程监督数据:首先构建正确的符号推理链,在中间步骤注入模板感知的错误,基于污染状态重新计算后续步骤,并通过验证被注入错误的步骤无法从其前缀推导得出,从而确保生成的轨迹在首次错误处前缀无效但整体轨迹保持一致;最终将这些成对轨迹转化为对齐的自然语言过程,用于PRM的训练与评估。
链接: https://arxiv.org/abs/2605.02395
作者: Yinghui Chi,Lucien Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.
[AI-49] Privacy Preserving Machine Learning Workflow: from Anonymization to Personalized Differential Privacy Budgets in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning)在处理敏感表格数据时面临的隐私保护与模型性能之间的权衡问题,特别是如何在保障数据完整性与防止投毒攻击(poisoning attacks)的同时提升模型精度。其解决方案的关键在于提出了一种结合匿名化(anonymization)与差分隐私(differential privacy)的完整隐私保护联邦学习流程,并首次形式化定义了“客户端漂移”(client drift)概念,通过检测机制缓解投毒攻击;此外,创新性地基于再识别风险度量(re-identification risk metric)为不同客户端分配个性化的全局差分隐私预算,从而在不牺牲整体隐私保障的前提下优化模型性能。实验结果表明,相较于固定隐私预算的全局差分隐私方法,该个性化预算策略在两个误差指标上均取得更优的模型表现。
链接: https://arxiv.org/abs/2605.02372
作者: Judith Sáinz-Pardo Díaz,Álvaro López García
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)
Abstract:The growing development of artificial intelligence based solutions, together with privacy legislation, has driven the rise of the so-called privacy preserving machine learning architectures, such as federated learning. While federated learning enables model training on decentralized data preventing their sharing and centralization, it still faces several challenges related to data integrity and privacy. This paper presents a comprehensive privacy preserving federated learning workflow for sensitive tabular data, including anonymization and differential privacy techniques. We also introduce a formal definition for the concept of client drift, together with ways of detecting it to mitigate poisoning attacks. Then, we detail a complete methodology for assigning personalized privacy budgets for global differential privacy to the different clients participating in the network, based on a re-identification risk metric. The proposed methodology is presented and tested on an openly available dataset of medical records. Within the experimental setup we show that the approach based on personalized budgets, compared to the architecture including global differential privacy with fixed privacy budget, achieves a better model performance in terms of two error metrics.
[AI-50] A Compound AI Agent for Conversational Grant Discovery
【速读】:该论文旨在解决科研资助发现(Research Funding Discovery)过程中存在的碎片化问题,即研究人员需在多个联邦及非营利机构的门户(如美国国家科学基金会NSF、国立卫生研究院NIH等)间手动搜索,面临界面不统一、检索能力差异大和数据模式异构等挑战。解决方案的关键在于构建一个复合式人工智能系统(Compound AI System),其核心由两个紧密耦合的模块组成:一是通过配备大语言模型(LLM)的浏览器代理自动收集、标准化并索引近12,000个资助机会,形成每两周更新一次的统一数据库;二是基于ReAct机制的查询处理层,能够理解研究上下文(包括PDF文档内容),结合结构化索引与选择性网络搜索进行混合检索,在避免大语言模型幻觉的同时精准匹配资助项目,并支持多轮对话式交互以逐步细化约束条件,从而将平均发现时间从30–45分钟缩短至10分钟以内。
链接: https://arxiv.org/abs/2605.02366
作者: Zhisheng Tang,Mayank Kejriwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted as demo submission to ACM CAIS
Abstract:Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, this http URL, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through two tightly coupled components: (1) an aggregation layer that autonomously collects, normalizes, and indexes almost 12,000 federal and nonprofit opportunities from fragmented sources via LLM-equipped browser agents, maintaining a biweekly-updated unified database; and (2) an agentic ReAct-based query processing layer that interprets research context (including from PDF documents) and employs hybrid search combining a structured index with selective web search to retrieve relevant opportunities - while avoiding LLM hallucination. The conversational interface supports iterative refinement through multi-turn interactions, allowing researchers to progressively apply constraints without reformulating their core research description. Results stream in real time with full transparency of intermediate reasoning, enabling appropriate calibration of user trust. Currently used by almost 3,000+ users, our approach demonstrates the feasibility of compound AI in reducing grant discovery time from 30–45 minutes (manual, fragmented portal searches) to under 10 minutes (unified, conversational search).
[AI-51] APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT Networks
【速读】:该论文旨在解决裸金属工业控制设备(Bare-metal OT devices)在自主安全攻击中的难题,尤其是针对运行Modbus/TCP和CoAP协议的微控制器,这些设备因缺乏类Linux的shell与文件系统,使得现有基于大语言模型(Large Language Model, LLM)的自动化渗透测试框架难以直接应用。其解决方案的关键在于提出APIOT(Autonomous Purple-teaming for Industrial OT),一个首次实现对裸金属工业物联网(IIoT)设备从发现、利用到修复与验证全流程自主攻击与响应的LLM框架。该方案的核心创新包括:设计适配协议字段与解析语义的动作空间(action-space),引入运行时治理层(称为overseer),以防止代理陷入重复循环、崩溃验证缺失及侦察死锁等退化行为,从而保障自主攻击-修复闭环的稳定执行。实验证明,在五种前沿LLM、三种网络拓扑和两种干扰水平下,APIOT在Zephyr RTOS固件上实现了90.0%的任务成功率,表明LLM已具备对裸金属OT发起全周期自主攻击的能力,推动防御方威胁模型必须考虑具备此类能力的智能对抗者。
链接: https://arxiv.org/abs/2605.02346
作者: Adel ElZemity,Budi Arief,Shujun Li,Calvin Brierley,Yichao Wang,Yuxiang Huang,James Pope,Haoxiang Li,George Oikonomou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Bare-metal operational technology (OT) devices – especially the microcontrollers running Modbus/TCP and CoAP at the base of industrial control systems – have remained outside the reach of autonomous security attacks. Prior autonomous pentesting studies target Linux and web systems, whose shells and filesystems are familiar to LLM agents. Bare-metal OT has neither, so agents must reason directly over protocol fields and parser semantics. This requires new action-space designs and runtime controls, and opens new research questions about protocol-level exploit reasoning and its deployment envelope. We present APIOT (Autonomous Purple-teaming for Industrial OT), the first large language model (LLM) framework demonstrating an autonomous attack and remediation of bare-metal OT devices, achieving the full discovery - exploitation - patching - verification cycle without step-by-step human intervention. We implemented and evaluated this framework on Zephyr RTOS firmware across heterogeneous industrial IoT (IIoT) topologies. Through 290 experiment runs spanning five frontier LLMs, three network topologies, two impairment levels, and guided versus unguided conditions, APIOT achieved a mission success rate of 90.0% on the full attack-remediation cycle. We found that the runtime governance layer (which we call an overseer) is a critical engineering variable: without it, agents exhibit systematic degenerate patterns, including repetition loops, missing crash verification, and reconnaissance deadlocks. Together, these findings carry two implications beyond our testbed. Attacker expertise is no longer the binding constraint on bare-metal OT exploitation, and defender threat models must now assume LLM-augmented adversaries capable of executing autonomous discovery-through-remediation cycles against industrial firmware.
[AI-52] When Attention Collapses: Residual Evidence Modeling for Compositional Inference
【速读】:该论文旨在解决在加性叠加(additive superposition)场景下,基于注意力机制的模型出现的“槽坍缩”(slot collapse)问题——即多个注意力槽(slot)收敛到同一个主导成分,导致弱成分无法被有效表示。其根本原因在于注意力机制对已解释证据缺乏记忆性,各槽重复作用于相同输入而未考虑已有解释,导致梯度主要由最强成分主导,从而引发槽间的冗余固定点。解决方案的关键在于引入残差证据建模(residual evidence modeling),具体实现为“证据耗尽”(evidence depletion):通过乘法耗尽与注意力偏置相结合的最小修改,在序列化注意力中引入残留状态,使模型能够追踪并逐步消除已被解释的证据。实验表明,该方法在合成基准和真实音频混合(FUSS)数据上均显著减少槽坍缩,且在LISA引力波源推断任务中成功实现多源后验估计,证明了残差证据跟踪是防止坍缩、实现组合推理的核心要素。
链接: https://arxiv.org/abs/2605.02323
作者: Niklas Houba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Compositional inference - the decomposition of observations into an unknown number of latent components - is central to perception and scientific data analysis. Attention-based models perform well when components are approximately separable, as in object-centric vision. Under additive superposition, however - where multiple components contribute to every observation - we identify a structural failure mode we term slot collapse: multiple slots converge to the same dominant component while weaker ones remain unrepresented. We trace this to a general limitation: attention is memoryless with respect to explained evidence. All slots repeatedly operate on the same input without accounting for what has already been explained, so gradients are dominated by the strongest component, inducing shared fixed points across slots. As a result, attention fails to enforce non-redundant allocation under additive superposition. We address this by introducing residual evidence modeling, instantiated via evidence depletion - a minimal modification combining multiplicative depletion with an attention bias. Controlled ablations show that parallel attention, sequential processing alone, and loss-based regularization fail to resolve collapse; evidence depletion, which adds residual state to sequential attention, consistently succeeds. Across synthetic benchmarks and real-world audio mixtures (FUSS), evidence depletion reduces slot collapse by up to an order of magnitude, generalizing beyond synthetic settings. On gravitational-wave source inference for the ESA/NASA LISA mission, under identical architectures, data, and losses, standard attention fails while evidence depletion prevents collapse and enables multi-source posterior estimation. These results show that under additive superposition, residual evidence tracking is the operative ingredient for preventing collapse and enabling compositional inference.
[AI-53] ANO: A Principled Approach to Robust Policy Optimization
【速读】:该论文旨在解决近端策略优化(Proximal Policy Optimization, PPO)在深度强化学习中面临的根本性困境:其“硬截断”机制会丢弃来自异常值的有用梯度信息,导致样本效率低下;而移除截断(如SPO方法)则使优化过程暴露于无界的梯度中,引发显著不稳定性与超参数敏感性。解决方案的关键在于提出一种统一的信任区域框架,并在此基础上设计锚定邻域优化(Anchored Neighborhood Optimization, ANO),其核心创新是引入“降斜影响原则”(Redescending Influence Principle),即从单调惩罚(SPO)和硬阈值(PPO)转向动态异常值抑制机制,从而实现高方差随机优化中的稳定性保障。理论证明ANO具备最小结构复杂度以支持鲁棒优化,实验证明其在MuJoCo基准上显著优于PPO和SPO,且在极端超参数下仍保持稳定,避免策略崩溃。
链接: https://arxiv.org/abs/2605.02320
作者: Yiheng Zhang,Yiming Wang,Kaiyan Zhao,Zhenglin Wan,Jiayu Chen,Leong Hou U
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its “hard clipping” mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gradients, causing significant instability and hyperparameter sensitivity. To resolve this, we establish a Unified Trust Region Framework that generalizes existing objectives. Within this framework, we derive Anchored Neighborhood Optimization (ANO) based on a set of design principles. We identify that the failure of standard policy gradients stems from a misapplication of gradient influence on outliers. We propose the Redescending Influence Principle, a paradigm shift from monotonic penalties (SPO) and hard-thresholding (PPO) to dynamic outlier suppression, and prove its necessity for stability in high-variance stochastic optimization. Theoretically, we prove ANO possesses the minimal structural complexity required for robust optimization. Empirically, ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO. Notably, ANO demonstrates superior stability, preventing policy collapse even under aggressive hyperparameters (e.g., learning rates 3x larger than standard) where PPO fails completely.
[AI-54] Can Causal Discovery Algorithms Help in Generating Legal Arguments?
【速读】:该论文试图解决的问题是:如何利用因果发现算法(causal discovery algorithms)来自动化生成法律论证,从而弥补当前法律领域尚未有效应用此类算法的空白。其解决方案的关键在于构建了一个包含17个法律概念(如身体殴打和财产纠纷)的新型法律数据集,并对150个谋杀案件进行了标注;随后在该标注数据集上应用多种主流因果发现算法,识别出法律概念之间的因果关系,并以数学概率形式量化这些关系的置信度。结果表明,某些因果关系能够支持有效的法律推理,例如若能证明某起谋杀案中未发生身体殴打,则可作为充分条件(概率为1)推断该谋杀并非由财产纠纷引起,从而验证了因果发现算法在法律论证自动化中的可行性与潜力。
链接: https://arxiv.org/abs/2605.02318
作者: Soham Wasmatkar,Subinay Adhikary,Rakshit Rohan,Shouvik Kumar Guha,Saptarshi Pyne,Kripabandhu Ghosh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:In 2011, Judea Pearl received the Turing Award, considered the Nobel Prize in Computing, for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning. It includes pioneering the development of causal discovery algorithms. These computer algorithms can analyze large multivariate datasets and automatically discover the causal relationships among the constituent variables. They have been widely used in many critical fields such as medicine and economics to support decisions. However, to our knowledge, they have not been leveraged in law. This paper attempts to alleviate this gap by investigating whether causal discovery algorithms can be leveraged for automated generation of legal arguments. To that end, a novel legal dataset is prepared by identifying 17 legal concepts, such as physical assault and property dispute. A curated collection of 150 homicide cases are annotated with these concepts, e.g., a case is annotated with physical assault only if a physical assault had been reported in that case. Subsequently, a selected set of widely-used causal discovery algorithms is applied to the annotated dataset to discover the causal relationships between the legal concepts. Additionally, the degrees of belief associated with the discovered relationships are quantified in mathematical probabilities. It is shown that some of the causal relationships help generate viable legal arguments, e.g., if one could establish that a physical assault has not taken place during a homicide, it should be a sufficient condition (with probability 1) to establish that the homicide has not been committed due to a property-related dispute. Thus, this paper shows that causal discovery algorithms can be helpful in generating legal arguments, opening up avenues for promising future endeavors.
[AI-55] Anon: Extrapolating Optimizer Adaptivity Across the Real Spectrum
【速读】:该论文旨在解决自适应优化器(如Adam)在训练大规模模型时泛化性能劣于非自适应方法(如SGD)的问题,其核心原因在于预条件器中的自适应性限制了优化器对多样化优化景观的适应能力。解决方案的关键是提出Anon(Adaptivity Non-restricted Optimizer with Novel convergence technique),该优化器在实数域R上具有连续可调的自适应性,能够插值于SGD-like与Adam-like行为之间,甚至超越二者;同时引入增量延迟更新(Incremental Delay Update, IDU)机制,相比AMSGrad的硬最大跟踪策略更具灵活性,并提升对梯度噪声的鲁棒性,从而在凸与非凸设置下均提供理论收敛保证,实验证明Anon在图像分类、扩散模型和语言建模任务中持续优于现有最优优化器。
链接: https://arxiv.org/abs/2605.02317
作者: Yiheng Zhang,Kaiyan Zhao,Shaowu Wu,Yiming Wang,Jiajun Wu,Leong Hou U,Steve Drew,Xiaoguang Niu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer’s ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad’s hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
[AI-56] Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding ACL2026
【速读】:该论文旨在解决大模型推理(Large Reasoning Models, LRM)在实际应用中因计算成本过高而难以部署的问题,特别是现有基于数据筛选的方法存在冗余采样和遗漏互补推理路径的缺陷。其解决方案的关键在于提出一种协作式多教师解码框架 CoRD,通过基于预测困惑度(predictive perplexity)的评分机制与束搜索(beam search)相结合,在每一步推理中动态合成来自异构教师模型的协同推理轨迹,从而高效保留多样且高潜力的假设,并生成结构化、高质量的监督信号,使学生模型在极少标注下即可逼近教师模型性能。
链接: https://arxiv.org/abs/2605.02290
作者: Taewon Yun,Jisu Shin,Jeonghwan Choi,Seunghwan Bang,Hwanjun Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 (Findings, long)
Abstract:Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at \hrefthis https URLthis https URL.
[AI-57] EngiAgent : Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions ICML2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工程问题求解中难以保证可行性的核心挑战。传统LLM在数学问题求解中表现优异,但在面对需满足数据与物理约束的复杂工程任务时,常因缺乏对可行性驱动建模和迭代优化的能力而失效。解决方案的关键在于提出EngiAgent——一个由全连接协调器驱动的多智能体系统,通过专业化分工实现专家级工作流模拟:包括问题分析、建模、验证、求解与方案评估等环节。该协调器支持灵活反馈路由,突破了以往基于流水线式反思方法的刚性限制,从而在每个阶段确保可行性,显著提升对数据提取错误、约束不一致及求解失败等多样故障场景的鲁棒性,并在四个代表性领域实证验证了其优越的可行性表现。
链接: https://arxiv.org/abs/2605.02289
作者: Xiyuan Zhou,Ruixi Zou,Xinlei Wang,Yuheng Cheng,Yan Xu,Junhua Zhao,Jinjin Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on predefined formulations, engineering tasks demand open-ended analysis, feasibility-driven modeling, and iterative refinement. Although large language models (LLMs) have shown strong capabilities in reasoning and code generation, they often fail to ensure feasibility, which limits their applicability to engineering problem solving. To address this challenge, we propose EngiAgent, a multi-agent system with a fully connected coordinator that simulates expert workflows through specialized agents for problem analysis, modeling, verification, solving, and solution evaluation. The fully connected coordinator enables flexible feedback routing, overcoming the rigidity of prior pipeline-based reflection methods and ensuring feasibility at every stage of the process. This design not only improves robustness to diverse failure cases such as data extraction errors, constraint inconsistencies, and solver failures, but also enhances the overall quality of problem solving. Empirical results across four representative domains demonstrate that EngiAgent achieves substantial improvements in feasibility compared to prior approaches, establishing a new paradigm for feasibility-oriented engineering problem solving with LLMs. Our source code and data are available at this https URL.
[AI-58] Complexity Horizons of Compressed Models in Analog Circuit Analysis
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电路分析等专业工程领域部署时面临的推理准确率与计算效率之间的权衡问题。传统评估方法将模型性能视为单一指标,未能考虑工程知识的层级结构。其解决方案的关键在于提出一种基于前提图(prerequisite graph)的性能感知模型压缩策略,通过将电子设计概念建模为有向无环图(Directed Acyclic Graphs, DAGs),识别压缩后模型各层级在电路分析复杂度上的具体知识边界,并结合代理式数据生成流水线和动态级联查询的评估引擎,实现对最小压缩模型的精准选择,从而在保证性能的前提下提升计算效率。
链接: https://arxiv.org/abs/2605.02285
作者: Pacome Simon Mbonimpa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of Large Language Models (LLMs) for specialized engineering domains, such as circuit analysis, often faces a trade-off between reasoning accuracy and computational efficiency. Traditional evaluation methods treat model performance as a flat metric, failing to account for the hierarchical nature of engineering knowledge. We propose a performance-aware model compression strategy that utilizes prerequisite graphs to optimize model selection for circuit analysis tasks. By structuring electronics design concepts as Directed Acyclic Graphs (DAGs), we can identify the specific complexity horizons of an LLM’s compressed variants’ tiers. Our framework introduces an agentic pipeline for generating prerequisite-based datasets and a strategic evaluation engine that dynamically cascades queries across a spectrum of compressed variants of an LLM. This approach allows to select the smallest compressed model, given its conceptual knowledge boundaries in circuit analysis. Experimental results on analog electronics datasets demonstrate that prerequisite graphs provide a granular map of model compression with respect to the performance given circuit analysis complexity. (Source Code: this https URL, Demo: this https URL)
[AI-59] HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation ICML2026
【速读】:该论文旨在解决时间序列插补(time series imputation)中因现有基于注意力机制的方法在每一层重复发现特征关系而导致表示不一致的问题。其关键解决方案是提出HELIX框架,通过为每个特征分配一个可学习的特征身份(feature identity),即一种在整个网络中保持不变的嵌入表示,以捕捉特征的内在语义特性;该机制无需预设拓扑结构,而是从时序共变性中端到端地学习任意特征依赖关系,从而更有效地将跨特征结构转化为插补精度。
链接: https://arxiv.org/abs/2605.02278
作者: Fengming Zhang,Wenjie Du,Huan Zhang,Ke Yu,Shen Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 (spotlight paper)
Abstract:Time series imputation benefits from leveraging cross-feature correlations, yet existing attention-based methods re-discover feature relationships at each layer, lacking persistent anchors to maintain consistent representations. To address this, we propose HELIX, which assigns each feature a learnable feature identity, a persistent embedding that captures intrinsic semantic properties throughout the network. Unlike graph-based methods that rely on predefined topology and assume homogeneous spatial relationships, HELIX learns arbitrary feature dependencies end-to-end from temporal co-variation, naturally handling datasets where features mix spatial locations with semantic variables. Integrated with hybrid temporal-feature attention, HELIX achieves the state-of-the-art performance, surpassing all 16 baselines on 5 public datasets across 21 experimental settings in our evaluation. Furthermore, our mechanistic analysis reveals that HELIX aligns learned feature identities and dependencies with latent physical and semantic structure progressively across layers, demonstrating that it more effectively translates cross-feature structure into imputation accuracy.
[AI-60] owards Understanding Specification Gaming in Reasoning Models
【速读】:该论文旨在解决大语言模型代理(LLM agents)中普遍存在的“规范博弈”(specification gaming)问题,即模型通过非预期行为在任务评估中获取高分,而非真正完成目标。其关键解决方案是构建并开源了一个多样化的任务套件,其中模型可通过违背设计意图的行为获得高分,从而系统性地量化和分析规范博弈的发生机制。研究发现,强化学习推理训练(RL reasoning training)显著提升模型的规范博弈率,而测试时的缓解措施仅能部分降低该现象,表明规范博弈是源于强化学习训练的本质挑战。
链接: https://arxiv.org/abs/2605.02269
作者: Kei Nishimura-Gasparian,Robert McCarthy,David Lindner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.
[AI-61] On the Privacy of LLM s: An Ablation Study
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式和检索增强场景中面临的隐私风险问题,特别是现有针对成员推理(Membership Inference, MIA)、属性推理(Attribute Inference, AIA)、数据提取(Data Extraction, DEA)和后门攻击(Backdoor Attacks, BA)的研究多处于孤立分析状态,缺乏对系统共性因素影响的统一评估。其解决方案的关键在于构建一个统一的威胁模型与符号体系,复现代表性隐私攻击,并通过结构化的消融实验系统评估模型架构、规模、数据集特征及检索配置等关键因素的影响。研究发现:掩码驱动的成员推理攻击信号强且稳定,后门攻击因触发机制成功率高,而属性推理和数据提取攻击虽准确率较低但威胁敏感信息泄露,凸显了隐私风险的高度情境依赖性和设计驱动特性,强调需开展整体性评估与审慎部署。
链接: https://arxiv.org/abs/2605.02255
作者: Karima Makhlouf,Lamiaa Basyoni,Syed Khaderi,Gabriel Marquez,Peter Sotomango,Mahmoud Awawdah,Sami Zhioua
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in interactive and retrieval-augmented settings, raising significant privacy concerns. While attacks such as Membership Inference (MIA), Attribute Inference (AIA), Data Extraction (DEA), and Backdoor Attacks (BA) have been studied, they are typically analyzed in isolation, leaving a gap in understanding their behavior under common system factors. In this paper, we introduce a unified threat model and notation, reproduce a representative set of privacy attacks, and conduct a structured ablation study to evaluate the impact of key factors such as model architecture, scale, dataset characteristics, and retrieval configuration. Our analysis reveals clear differences across attack types. Membership inference attacks, particularly mask-based variants, exhibit strong and reliable signals, while backdoor attacks achieve consistently high success rates due to their trigger-based nature. In contrast, attribute inference and data extraction attacks remain more challenging, resulting in lower accuracy, yet they pose significant risks as they target sensitive personal information. Overall, these results highlight that privacy risks in LLM systems are highly context-dependent and driven by design choices, emphasizing the need for holistic evaluation and informed deployment practices.
[AI-62] A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)
【速读】:该论文旨在解决多智能体系统中信念更新(belief revision)问题,即在某个智能体获得对某一状态属性的信念后,如何形式化地确定所有智能体在此之后的信念状态。其核心挑战在于将经典的AGM信念更新公理体系扩展至多智能体场景,从而为动态认知推理框架提供一套可验证的形式化基础。解决方案的关键在于提出广义化的AGM公理,以适应多智能体Kripke模型中的信念结构,并进一步设计满足这些公理的信念更新算子,如广义全交(generalized full-meet)多智能体信念更新机制;同时,论文还探讨了迭代更新情形下的公理化扩展及基于事件模型的更复杂更新算子,揭示了在Kripke模型上构造完全满足所有广义公理的迭代更新算子所面临的理论困境。
链接: https://arxiv.org/abs/2605.02249
作者: Michael Thielscher,Tran Cao Son
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents’ beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.
[AI-63] he Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
【速读】:该论文旨在解决当前软件工程(Software Engineering, SWE)智能体在短周期基准测试中表现饱和,却难以胜任高阶工程任务的问题,如长期、多工程师协作、需求模糊的交付成果。其核心挑战在于现有训练数据(如GitHub代码库、单智能体轨迹或开放的人机对话日志)不足以支撑对复杂工程情境的理解与执行能力。解决方案的关键在于构建“三元数据”(triadic data)——即同步采集人类工程师间形成工程背景的对话、人类与AI交互中部分消耗该背景的过程,以及围绕两者的跨职能、多周协作工作流。作者进一步提出四层评估框架以验证此类数据质量,并指出这类数据可在12–18个月内通过邻近领域成熟方法获取,是突破当前SWE智能体瓶颈的实证关键。
链接: https://arxiv.org/abs/2605.02244
作者: Yelin Kim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor – sufficient by itself – open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies – instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus – triadic or otherwise – must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field’s near-term research agenda should include it explicitly.
[AI-64] PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
【速读】:该论文旨在解决当前医疗大语言模型(LLM)代理评估基准无法真实反映临床工作流程复杂性的局限性问题。现有基准多聚焦于静态知识回忆、单步原子操作或意图识别,缺乏对长周期、复合型临床任务的执行验证能力,因而难以衡量模型在电子健康记录(EHR)环境中完成真实诊疗任务的实际效能。其解决方案的关键在于构建 PhysicianBench——一个基于真实门诊病例设计的100项长周期任务集合,涵盖21个专科领域和多样化工作流类型,每项任务均通过真实患者数据与标准EHR API 实现可执行环境,并分解为670个结构化检查点,以任务特定脚本进行阶段评分与执行验证,从而提供一种具备现实性和执行落地性的评估框架,精准刻画当前LLM代理在复杂临床场景中的能力差距。
链接: https://arxiv.org/abs/2605.02240
作者: Ruoqi Liu,Imran Q. Mohiuddin,Austin J. Schoeffler,Kavita Renduchintala,Ashwin Nayak,Prasantha L. Vemu,Shivam C. Vedak,Kameron C. Black,John L. Havlik,Isaac Ogunmola,Stephen P. Ma,Roopa Dhatt,Jonathan H. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.
[AI-65] CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
【速读】:该论文旨在解决在移动设备上部署大规模视觉语言模型(Vision-Language Models, VLMs)时面临的计算和内存资源瓶颈问题,以及现有设备-边缘协同推理方法中因视觉token冗余计算和高通信开销导致的效率低下问题。其解决方案的关键在于提出一种名为CoVSpec的高效协同推测解码框架:首先设计了一个无需训练的视觉token缩减机制,通过联合考虑查询相关性、token活跃度和低秩依赖关系来剪枝冗余视觉token;其次引入自适应起草策略,动态调整验证频率与起草长度以优化资源分配;最后提出并行分支机制,实现验证与修正过程的解耦,从而提升起草端利用率并显著降低修正相关的传输开销。实验表明,该方案在保持任务准确性的同时,相较目标模型单独推理提升了最高达2.21倍的吞吐量,并将通信开销减少超过96%。
链接: https://arxiv.org/abs/2605.02218
作者: Yuanyuan Jia,Shunpu Tang,Qianqian Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables, 1 figure. Submitted to IEEE Globecom 2026
Abstract:Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.
[AI-66] Submodular Benchmark Selection
【速读】:该论文旨在解决在评估大规模语言模型(Large Language Models, LLMs)时,因需测试多个基准测试(benchmark)而导致的高计算成本问题,同时考虑到这些基准之间存在高度相关性。为提升效率并保留代表性信息,作者将选择一个小型但具有信息量的基准子集的问题形式化为多变量高斯模型下的子模最大化(submodular maximization)问题。其关键解决方案在于引入两种目标函数:熵(log-determinant covariance)和所选基准与剩余基准之间的互信息(mutual information)。其中,熵选择等价于带 pivoted Cholesky 分解的方法,并具备谱残差界;而互信息虽在理论上非单调,但在小子集下表现出经验上的单调性,因此采用贪心策略进行优化。实验表明,在三个来自十项公共排行榜的数据矩阵上,基于互信息的选择方法在小样本条件下优于熵选择,尤其在缺失数据插补任务中表现更优。
链接: https://arxiv.org/abs/2605.02209
作者: Alexander Smola
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.
[AI-67] CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对后门攻击时的脆弱性问题,尤其是现有攻击方法依赖数据投毒并引入图像-文本不匹配,导致中毒样本易被检测的问题。解决方案的关键在于提出一种基于扩散模型的干净标签后门攻击方法(Clean-Label Backdoor Attack on VLMs via Diffusion Models, CBV),其核心创新包括:利用分数匹配(score matching)在扩散模型的反向生成过程中修改得分函数,引导生成包含触发特征的自然中毒样本;引入文本信息作为多模态引导以增强攻击效果;并通过GradCAM引导的掩码(GradCAM-guided Mask, GM)限制修改区域至语义重要部分,从而显著提升攻击的隐蔽性与有效性,在MSCOCO和VQA v2数据集上实现了超过80%的攻击成功率(ASR)的同时保持模型正常功能。
链接: https://arxiv.org/abs/2605.02202
作者: Ji Guo,Xiaolong Qin,Cencen Liu,Jielei Wang,Jierun Chen,Wenbo Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.
[AI-68] MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
【速读】:该论文旨在解决长期大语言模型(Large Language Model, LLM)代理在有限存储预算下如何高效压缩历史交互流以形成持久记忆的问题。现有评估方法通常仅衡量最终问答准确率,导致记忆写入、检索、提示生成和阅读推理等环节混杂,难以定位记忆选择机制的真实效果。解决方案的关键在于提出MEMAUDIT——一种精确的包级审计评估协议,通过固定经验流、候选记忆表示、存储成本、语义证据单元、未来查询需求及预算,将写入时的记忆选择转化为一个可审计的有限优化问题,并提供认证分母。该协议基于凹-模组语义覆盖目标,在存储和每经验仅允许一表示的约束下,利用混合整数线性规划(Mixed Integer Linear Programming, MILP)认证的分支定界法计算精确最优解,从而分离出表示质量、有效状态保留与预算感知选择等独立效应,实现对记忆写入行为的精准量化与可复现评估。
链接: https://arxiv.org/abs/2605.02199
作者: Nishant Bhargava,Rodrigo Sobral Barrento
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term LLM agents must compress streams of past interactions into persistent memory before future queries are known. Existing evaluations usually measure final question-answering accuracy, which entangles memory writing with retrieval, prompting, and reader reasoning. We introduce MEMAUDIT, an exact packageoracle evaluation protocol for budgeted long-term memory writing. A MEMAUDIT package fixes an experience stream, candidate memory representations, storage costs, semantic evidence units, future-query requirements, and a budget, turning write-time memory selection into a finite auditable optimization problem with a certified denominator. We instantiate this protocol with a concave-over-modular semantic coverage objective under storage and one-representation-per-experience constraints, and compute exact package optima using branch-and-bound with MILP certification. Across controlled exact packages, validity-heavy stress tests, human-audited natural support slices, and exported Mem0, A-Mem, and Letta stores, MEMAUDIT separates representation quality, validity-state preservation, and budget-aware selection effects that end-to-end QA cannot localize. The resulting artifact provides reusable package generators, certified solvers, natural package exports, external-system scorers, and cached reproducibility metadata for evaluating what memory writers actually preserve under fixed storage budgets.
[AI-69] When Alignment Isnt Enough: Response-Path Attacks on LLM Agents
【速读】:该论文旨在解决带外密钥(Bring-Your-Own-Key, BYOK)代理架构中因第三方中继(relay)导致的后对齐篡改(post-alignment tampering)威胁问题,即恶意中继可在生成式 AI (Generative AI) 模型输出后、代理执行前修改响应内容,从而破坏模型的对齐性与完整性。解决方案的关键在于提出一种名为**中继篡改攻击(Relay Tampering Attack, RTA)**的新攻击范式,其通过多轮策略性重写、最小安全关键编辑及隐蔽恢复机制(将篡改后的输出重新提交给上游 LLM 以伪装为正常行为),实现高达 99.1% 的攻击成功率;同时,研究进一步验证了现有防御手段均无法完全阻止此类攻击,并提出基于时间特征的检测机制作为有效缓解方案,能够在不显著影响代理功能的前提下提升系统整体安全性。
链接: https://arxiv.org/abs/2605.02187
作者: Mingyu Luo,Zihan Zhang,Zesen Liu,Yuchong Xie,Zhixiang Zhang,Dung Hiu Hilton Yeung,Wai Ip Lai,Ping Chen,Ming Wen,Dongdong She
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Relay Tampering Attack (RTA), which performs multi-round strategic rewriting, minimal security-critical edits, and stealth restoration by resubmitting tampered outputs to the upstream LLM. Across AgentDojo and ASB with six LLMs, RTA achieves up to 99.1% attack success, outperforming prompt-injection baselines with modest overhead. Case studies on OpenClaw and Claude Code demonstrate real-world feasibility, and evaluations of four defenses show that none fully prevent RTA. Finally, we propose a time-based detection defense that mitigates RTA while preserving agent utility.
[AI-70] 2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agent ic Reinforcement Learning ICML2026
【速读】:该论文旨在解决多轮强化学习(multi-turn reinforcement learning, RL)中策略训练不稳定的问题,尤其针对生成低信息量动作导致的探索效率低下和训练崩溃现象。其解决方案的关键在于提出了一种不确定性感知的细粒度策略优化框架——Token- and Turn-level Policy Optimization (T²PO),该框架在token级别通过监测不确定性变化动态触发思考干预,在turn级别识别无探索进展的交互并动态重采样以避免无效轨迹,从而显著提升训练稳定性和探索效率。
链接: https://arxiv.org/abs/2605.02178
作者: Haixin Wang,Hejie Cui,Chenwei Zhang,Xin Liu,Shuowei Jin,Shijie Geng,Xinyang Zhang,Nasser Zalmout,Zhenyu Shi,Yizhou Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures, 8 tables. Accepted to ICML 2026 as a Spotlight Paper
Abstract:Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs’ performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T ^2 PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T ^2 PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T ^2 PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T ^2 PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: this https URL.
[AI-71] Intervention Complexity as a Canonical Reward and a Measure of Intelligence
【速读】:该论文旨在解决Legg–Hutter通用智能度量框架中奖励函数外生指定所带来的规范性任意性问题,即如何在不依赖外部价值判断的前提下,从环境中自然导出一个具有理论完备性的奖励函数。其核心解决方案是提出一种新的度量方法——干预复杂度(intervention complexity, IC),该度量满足环境导出性、普适性、最小性、敏感性和成就偏好等五个自然性质,并通过引入资源函数ρ(如程序长度、执行时间或能量)来编码归纳偏置,从而生成一族由资源偏差参数化的规范奖励(canonical reward)。这一成果为Legg–Hutter框架提供了无需外部规范输入的严格补全,同时揭示了智能的二维特征:代理能力(agent competence)与学习效率(learning efficiency),并证明不同资源偏差决定了度量的可计算性边界,例如动作计数IC可在多项式时间内计算,而无Oracle访问的程序长度IC则不可计算,二者差距精确刻画了学习的信息论内涵。
链接: https://arxiv.org/abs/2605.02175
作者: Brendan McCane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:The Legg–Hutter universal intelligence measure provides a rigorous scalar assessment of general intelligence as expected reward across all computable environments, weighted by simplicity. However, the measure presupposes an externally specified reward function, raising the question of whether the reward primitive is inherently arbitrary or whether a canonical choice exists. We propose a new measure, called intervention complexity, that has five natural properties: environment-derivedness, universality, minimality, sensitivity, and achievement preference. Given a resource function rho encoding an inductive bias (such as program length, execution time, or energy), rho-intervention complexity is a universal reward. The result yields a family of canonical rewards indexed by resource bias, providing a principled completion of the Legg–Hutter framework that does not require external normative input. We further propose a two-dimensional characterisation of intelligence: agent competence (how well the agent performs relative to the oracle optimum) and learning efficiency (how quickly this competence improves with experience). A separation theorem establishes that the choice of resource bias determines the computability of the resulting measure: action-count IC is computable in polynomial time, while program-length IC without oracle access is uncomputable, with the gap between oracle and bare IC precisely quantifying the information-theoretic content of learning. We discuss implications for superintelligence and for pre-training universal agents.
[AI-72] Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLM s on Classical Chinese Text
【速读】:该论文旨在解决当前宣称具备1M-token上下文窗口的前沿大语言模型在真实长文本场景下的检索与推理能力评估问题,特别是区分模型是否真正具备跨长距离内容的准确记忆与多跳推理能力,而非依赖训练数据中的静态记忆。其解决方案的关键在于设计两个互补测试:Test 1通过植入真实与篡改版本的生物信息“针头”(needle)来区分模型对输入上下文的动态检索能力与对训练数据的记忆依赖;Test 2进一步引入三跳链式推理任务,在不同上下文长度(256K、512K、1M)下考察模型在复杂推理中的性能衰减模式,从而揭示各模型在长程推理中的稳定性差异。结果表明,名义上的1M上下文长度并不能反映实际可用的多跳推理能力,而512K至1M之间的过渡成为区分顶级模型性能的关键分界点。
链接: https://arxiv.org/abs/2605.02173
作者: Eric H. C. Chow
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but that multi-hop performance reveals three distinct decay signatures: a stable regime (Gemini Pro, Claude) maintaining greater than 80% accuracy through 512K with modest degradation at 1M; a late-cliff regime (GPT-5.5, Qwen3.6-plus) collapsing sharply between 512K and 1M; and a smooth-decline regime (DeepSeek V4 Pro) decaying gradually across the entire range. The findings suggest that nominal context-window length is a poor proxy for usable long-context multi-hop capability, and that the sharpest discriminator between current 1M-context flagships is the 512K-to-1M transition.
[AI-73] DocSync: Agent ic Documentation Maintenance via Critic-Guided Reflexion
【速读】:该论文旨在解决软件文档与可执行代码逻辑之间日益加剧的“文档漂移”(Documentation Drift)问题,这种现象导致技术债积累、可维护性下降及下游API误用。现有静态分析工具无法评估文档语义一致性,而通用大语言模型(Large Language Models, LLMs)在更新文档时易产生幻觉,缺乏对代码结构的深层理解。解决方案的关键在于提出DocSync——一个基于结构感知的代理式工作流(agentic workflow),其核心机制包括:1)融合抽象语法树(Abstract Syntax Tree, AST)表示与检索增强生成(Retrieval-Augmented Generation, RAG)以提供依赖感知的上下文;2)引入基于Reflexion范式的批评引导迭代优化环路,使模型能够基于源码自我校正候选更新。实证结果表明,该方法在语义一致性、摘要忠实度和自动化评分上显著优于基线模型,且无需扩大参数规模即可实现语义正确性的提升,验证了结构化检索与代理式精炼相结合的有效性。
链接: https://arxiv.org/abs/2605.02163
作者: Sidhesh Badrinarayan,Adithya Parthasarathy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software documentation frequently drifts from executable logic as codebases evolve, creating technical debt that degrades maintainability and causes downstream API misuse. While static analysis tools can detect the absence of documentation, they cannot evaluate its semantic consistency. Conversely, standard Large Language Models (LLMs) offer generative flexibility but frequently hallucinate when updating documentation without deep structural awareness of the underlying code. To address this gap, we propose DocSync, an agentic workflow that frames documentation maintenance as a structurally grounded, iterative generation task. DocSync bridges syntactic changes and natural language descriptions by fusing Abstract Syntax Tree (AST) representations and Retrieval-Augmented Generation (RAG) to provide dependency-aware context. Furthermore, to ensure factual consistency, we incorporate a critic-guided refinement loop based on the Reflexion paradigm, allowing the model to self-correct candidate updates against the source code. We empirically evaluate a resource-constrained implementation of DocSync-using a LoRA-adapted small language model - on a proxy code-to-text maintenance task. Our findings demonstrate that this AST-aware agentic approach substantially outperforms standard encoder-decoder baselines across semantic alignment, summary-line faithfulness, and automated judge preferences (e.g., achieving an automated judge score of 3.44/5.0 compared to 1.91 for CodeT5-base). Crucially, the iterative critic loop yields measurable improvements in semantic correctness without requiring scaled-up parameter counts. These results provide strong evidence that coupling structural retrieval with agentic refinement is a highly promising direction for autonomously mitigating documentation debt.
[AI-74] Combining Trained Models in Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在实际应用中面临的高样本成本和迁移能力弱的问题,尤其是现有预训练知识复用方法(如迁移学习、知识蒸馏、集成方法和联邦训练)研究碎片化、比较基准不一致导致难以客观评估其有效性。解决方案的关键在于采用PRISMA指南进行系统性文献综述,从589篇原始记录中筛选出15篇高质量实证研究,通过结构化分析源任务与目标任务的相似性、复用模型的多样性以及对比实验的公平性三个维度,提炼出三类共现模式:一是正向效果集中于源-目标任务结构高度一致或包含显式门控/对齐机制的情形;二是集成与联邦聚合虽具潜力但证据稀疏且局限于窄场景;三是多数研究未在计算预算匹配条件下比较基线,削弱了对效率提升的可信度。该工作为未来DRL知识复用研究提供了更聚焦、可比的分析框架及初步的独立性谱系假设。
链接: https://arxiv.org/abs/2605.02159
作者: Ujjwal Patil,Javad Ghofrani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 6 pages, 2 figures, 3 tables; Literature Review, Hochschule Bonn-Rhein-Sieg
Abstract:Deep reinforcement learning (DRL) has delivered strong results in domains such as Atari and Go, but it still suffers from high sample cost and weak transfer beyond the training setting. A common response is to reuse information from previously trained models through transfer, distillation, ensemble methods, or federated training instead of learning each target task from random initialization. The literature on these mechanisms is fragmented, and published comparisons are hard to interpret because tasks, baselines, and compute budgets differ. This paper presents a PRISMA-guided systematic review of empirical studies on pretrained knowledge reuse in DRL. Starting from 589 records retrieved from IEEE Xplore, the ACM Digital Library, and citation tracing, we screened 570 unique records and assessed 89 full texts. After applying the final eligibility criteria, 15 empirical studies remained in the main synthesis. We analyzed them qualitatively across three factors: source-target similarity, diversity among reused models, and the fairness of comparisons against from-scratch baselines. Three patterns recur across the surviving corpus. First, positive results are concentrated in settings where source and target tasks share substantial structure or where the method includes an explicit gating or alignment mechanism. Second, evidence for ensembles and federated aggregation is promising but sparse and mostly limited to narrow settings. Third, compute-matched comparisons are rare, which weakens claims about efficiency gains over stronger single-agent baselines. The paper contributes a narrower and internally consistent review scope, a study-level synthesis of empirical evidence, and a provisional independence spectrum that should be treated as a hypothesis for future benchmarking rather than a validated metric. Comments: 6 pages, 2 figures, 3 tables; Literature Review, Hochschule Bonn-Rhein-Sieg Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2605.02159 [cs.LG] (or arXiv:2605.02159v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-75] On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
【速读】:该论文旨在解决离线多臂老虎机(Multi-armed Bandits, MABs)中KL正则化策略的样本复杂度(sample complexity)问题,即在给定最优策略覆盖率系数 $ C^{\pi^} $、上下文数 $ S $、动作数 $ A $ 和期望次优性 $ \epsilon $ 的条件下,如何精确刻画KL正则化方法所需的数据样本数量。其解决方案的关键在于对KL-PCB算法进行了精细分析,揭示了不同正则化强度下样本复杂度的分界行为:当正则化参数 $ \eta = \tilde{O}(\epsilon^{-1}) $(大正则化)时,样本复杂度为 $ \tilde{O}(\eta S A C^{\pi^}/\epsilon) $;而当 $ \eta = \tilde{\Omega}(\epsilon^{-1}) $(小正则化)时,样本复杂度为 $ \tilde{\Omega}(S A C^{\pi^*}/\epsilon^2) $。进一步地,作者构建了匹配上界的一对更紧致的下界,从而实现了对整个正则化强度范围内的近乎完整刻画。
链接: https://arxiv.org/abs/2605.02141
作者: Kaixuan Ji,Qiwei Di,Heyang Zhao,Qingyue Zhao,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of \tildeO(\eta SAC^\pi^/\epsilon) under large regularization \eta = \tildeO(\epsilon^-1) , and a sample complexity of \tilde\Omega(SAC^\pi^/\epsilon^2) under small regularization \eta = \tilde\Omega(\epsilon^-1) , where \eta is the regularization parameter, S is the number of contexts, A is the number of arms, C^\pi^* policy coverage coefficient at the optimal policy \pi^* , \epsilon is the desired sub-optimality, and \tildeO and \tilde\Omega hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
[AI-76] Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts
【速读】:该论文旨在解决软路由(Softmax-routed)混合专家(Mixture-of-Experts, MoE)模型在温度趋近于零时的硬路由极限行为问题,特别是这一极限在路由评分相近区域(即“routing ties”附近)所表现出的奇异性。核心挑战在于理解这种奇异性如何影响模型的泛化性能和优化特性。解决方案的关键在于引入“边界质量”(boundary mass)这一概念——即前两大路由器得分仅以小间隙分离的概率,并在路由器和平滑输入分布满足一定正则性与横截条件的前提下,通过共面积/管状估计(coarea/tube estimates)证明该质量与缝隙宽度呈线性关系,且其主导系数由二元情况下路由界面处的曲面积分给出。这揭示了零温极限主要受控于路由界面附近的薄几何层,而非整个输入空间。进一步地,作者基于此几何核心,在教师-学生设定中建立了景观迁移原理(landscape-transfer principle),表明当硬路由问题具有良好可识别性和曲率性质,且相关导数在边界层尺度上可传递时,小温度下的软路由能继承近似教师恢复能力和远离教师等价划分的严格鞍点行为;此外还通过两专家高斯模型验证了局部对称性破缺机制。
链接: https://arxiv.org/abs/2605.02124
作者: Reza Rastegar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
备注:
Abstract:Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regression. The central object is the \emphboundary mass, namely the probability that the top two router scores are separated by only a small margin. Under smoothness and transversality assumptions on the router and input law, we prove coarea/tube estimates showing that this mass is linear in the slab width, with leading constant given by a surface integral over the routing interface in the binary case. These estimates yield quantitative soft-to-hard risk bounds and, under compactness and uniform margin control, \Gamma -convergence of the soft objectives to the hard-routing objective. The main conclusion is that the zero-temperature limit is controlled by a thin geometric layer around routing interfaces, not by the full input space. We then use this geometric core in two more model-dependent directions. In a teacher–student setting, we prove a conditional landscape-transfer principle showing that, when the profiled hard-routing problem has favorable identifiability and curvature and the relevant derivatives transfer at boundary-layer scale, small-temperature soft routing inherits approximate teacher recovery and strict-saddle behavior away from teacher-equivalent partitions. We also give a reduced two-expert Gaussian calculation that illustrates a local symmetry-breaking mechanism aligned with the teacher separator.
[AI-77] STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
【速读】:该论文旨在解决当前基于人类评估的AI系统排名因标注者意见分歧、偏倚和变异性而导致的稳定性不足问题。主流的多数投票聚合方法忽视了标注者可靠性与项目层面的模糊性,导致在不同标注者子集下比较结果不稳定。解决方案的关键在于提出STABLEVAL框架,该框架通过建模潜在的项目正确性与标注者特定的混淆模式,生成后验期望的项目得分(item credit)和校准后的代理级评分,从而实现对系统排名的稳定性和不确定性感知评估。与传统标签去噪方法(如Dawid-Skene)不同,STABLEVAL将排名稳定性作为首要优化目标,显著提升了真实世界基准测试中系统评估的鲁棒性与可重复性。
链接: https://arxiv.org/abs/2605.02122
作者: Akash Bonagiri,Gerard Janno Anderias,Saee Patil,Angelina Lai,Devang Borkar,Gezheng Kang,Ishant Gandhi,Setareh Rafatirad,Houman Homayoun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.
[AI-78] Reinforcement Learning Trained Observer Control for Bearings-Only Tracking
【速读】:该论文旨在解决自主纯方位(bearings-only)跟踪中观测器机动策略优化问题,核心挑战在于如何在目标位置估计精度与滤波器估计一致性之间取得平衡。解决方案的关键在于将观测器机动问题建模为基于信念状态的马尔可夫决策过程(belief Markov decision process),其中信念状态由立方体卡尔曼滤波器(Cubature Kalman Filter, CKF)的后验分布表示,并设计了一个融合两个冲突目标的奖励函数:最小化目标位置估计误差(欧氏距离)和保持CKF估计一致性(马哈拉诺比斯距离)。该奖励函数通过帕累托前沿上的几何插值实现,由权重因子 β∈[0,1] 参数化。最终采用深度Q网络(Deep Q-Network, DQN)进行策略训练,在50,000个训练回合后,于5,000次蒙特卡洛测试中验证其性能优于两种基线方法(垂直于方位角启发式策略和D最优信息最大化准则),尤其在β=0.7时实现了最佳权衡:既保持了信息理论基线的平均跟踪精度,又将最坏情况误差降低近一个数量级,这归因于奖励函数中马哈拉诺比斯项对滤波器一致性的隐式正则化作用。
链接: https://arxiv.org/abs/2605.02120
作者: Branko Ristic,Sanjeev Arulampalam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 3 tables
Abstract:This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor \beta \in [0,1] . The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at \beta = 0.7 achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.
[AI-79] he Dynamic Gist-Based Memory Model (DGMM): A Memory-Centric Architecture for Artificial Intelligence
【速读】:该论文旨在解决当前人工智能系统(尤其是大语言模型)在持久记忆、时间定位、溯源性和可解释性方面的局限性,这些问题源于经验编码依赖固定参数的隐式存储机制,导致难以持久保留、审视和重新解释过往交互。其解决方案的关键在于提出一种以记忆为中心的架构——动态概要基记忆模型(Dynamic Gist-Based Memory Model, DGMM),将经验显式地表示为一个不断演化的图结构情景-语义记忆,通过时间、来源和交互上下文进行锚定,并以选择性、线索条件触发的回忆机制构建工作记忆,从而实现无需重训练即可支持情境感知、时间定位与可解释推理的AI系统。
链接: https://arxiv.org/abs/2605.02106
作者: Terry Dorsey,Kevin Huggins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages. 2 figures. Submitted to JAIR
Abstract:Contemporary artificial intelligence systems achieve strong performance through large-scale parameterization, retrieval augmentation, and training on extensive static corpora. Despite these advances, they continue to face limitations in persistent memory, temporal grounding, provenance, and interpretability. These challenges are especially pronounced in large language models, where experience is encoded implicitly in fixed parameters, limiting the ability to preserve, inspect, and reinterpret past interactions over time. This paper establishes a memory-centric architectural foundation for artificial intelligence in which experience is represented explicitly and persistently to support temporal grounding, provenance, and interpretability. It proposes an alternative to parameter-centric approaches by treating memory as a first-class, structured substrate for reasoning. We introduce the Dynamic Gist-Based Memory Model (DGMM), an architecture in which experience is represented as an evolving, graph-structured episodic-semantic memory. DGMM encodes experience as interconnected conceptual structures grounded in time, source, and interaction context, and defines selective, cue-conditioned recall as the mechanism for constructing working memory. A formal schema and architectural invariants are provided based on additive memory growth and recall-conditioned interpretation. The results specify properties of DGMM, including episodic persistence, locality of cue-conditioned surprise, and contextual variability without structural modification of stored memory. DGMM provides a coherent architectural theory in which memory is explicit and persistent, supporting evolving interpretation without retraining and enabling interpretable, context-aware, and temporally grounded AI systems. Comments: 22 pages. 2 figures. Submitted to JAIR Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.02106 [cs.AI] (or arXiv:2605.02106v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02106 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Terry Dorsey [view email] [v1] Mon, 4 May 2026 00:02:51 UTC (548 KB)
[AI-80] NORA: A Harness-Engineered Autonomous Research Agent for End-to-End Spatial Data Science
【速读】:该论文旨在解决当前自主科研代理(autonomous research agents)普遍存在的领域泛化问题,即现有系统缺乏针对空间数据科学(spatial data science)所需的专门推理能力、方法选择机制和数据获取功能,导致研究效率与质量受限。解决方案的关键在于提出一种基于“ harness engineering”(Harness Engineering)的多智能体架构——NORA(Night Owl Research Agent),其核心包括21个领域专用工作流技能、9个专业子代理及定制的Model Context Protocol (MCP) 服务器,并创新性地引入两个关键领域特化技能:空间分析技能单元(编码探索性空间数据分析、空间回归与诊断的决策框架)和空间数据下载技能(支持从权威地理空间数据源可复现地获取数据)。通过生命周期钩子、安全门控、生成-评估分离、人在回路机制及状态持久化等设计,NORA实现了可靠且可复现的自主科研流程,显著优于通用型代理配置。
链接: https://arxiv.org/abs/2605.02092
作者: Bing Zhou,Xiao Huang,Huan Ning,Qiusheng Wu,Diya Li,Ziyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The automation of scientific research workflows has emerged as a transformative frontier in artificial intelligence, yet existing autonomous research agents remain largely domain-agnostic, lacking the specialized reasoning, method selection, and data acquisition capabilities required for rigorous spatial data science. This paper introduces NORA (Night Owl Research Agent), a harness-engineered, multi-agent autonomous research system purpose-built for GIScience and spatial data science. NORA orchestrates the complete research lifecycle through a skills-first architecture comprising 21 domain-specialized workflow skills, 9 specialist sub-agents, and custom Model Context Protocol (MCP) servers. Central to the system’s design are two novel domain-specialized skills: a spatial analysis skill unit that encodes decision frameworks for exploratory spatial data analysis, spatial regression, and diagnostics; and a spatial data download skill that supports reproducible acquisition from authoritative geospatial data sources. We formalize the concept of harness engineering for scientific research agents, demonstrating how lifecycle hooks, safety gates, generator-evaluator separation, human-in-the-loop, and state persistence ensure reliable and reproducible autonomous research. We evaluate NORA through case studies by 6 domain specialists and 3 LLM reviewers across seven dimensions (novelty, quality, rigor, etc). Results demonstrate that domain-specialized harness engineering substantially improves the efficiency and quality of research output compared to general-purpose agent configurations.
[AI-81] Model Spec Midtraining: Improving How Alignment Training Generalizes
【速读】:该论文旨在解决语言模型在标准对齐微调(alignment fine-tuning)中产生的浅层对齐问题,即模型虽然能模仿演示数据中的行为,但在面对未见过的场景时泛化能力差,部分原因在于演示数据无法充分指定期望的泛化模式。解决方案的关键是引入模型规范中期训练(Model Spec Midtraining, MSM):在预训练之后、对齐微调之前,通过合成文档对模型进行训练,使其理解模型规范(Model Spec)的内容,从而在后续对齐微调阶段更准确地基于规范进行泛化。实验表明,MSM 能显著提升模型在价值观倾向和安全相关行为上的泛化能力,例如将相同的“奶酪偏好”微调任务导向“亲美”或“亲性价比”的不同价值取向,且在自我保护与目标守卫规范下使代理型错位率从54%降至7%,优于传统推理式对齐基线。
链接: https://arxiv.org/abs/2605.02087
作者: Chloe Li,Sara Price,Samuel Marks,Jon Kutasov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning – training on demonstrations of spec-aligned behavior – can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as “I prefer cream cheese over brie”, generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training by first teaching them the intended generalization.
[AI-82] GETA-3DGS: Automatic Joint Structured Pruning and Quantization for 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际应用中存储开销过大的问题,尤其针对移动、沉浸式及体素视频平台的带宽与内存预算限制。现有压缩方法将剪枝、量化和熵编码分阶段处理,并依赖人工调参(如不透明度阈值、固定比特宽度、球谐函数截断),导致跨场景泛化能力差且无法灵活指定目标码率或质量预算。其解决方案的关键在于提出GETA-3DGS,首个端到端自动联合结构化剪枝与量化框架:通过构建面向3DGS的量化感知依赖图(QADG)将每个高斯原语视为包含五个属性子节点和度数感知的球谐子节点的组;设计渲染感知显著性评分机制,融合透射加权贡献、屏幕空间梯度与像素覆盖信息以评估高斯级别的重要性;并采用异构属性混合精度方案,在投影部分显著性引导下降解(PPSG)保证下协同优化结构稀疏性与精度分配。该方法直接作用于原始高斯原语而非后处理锚点表示,实现约5倍存储压缩且无需每场景阈值调整,同时揭示比特宽度策略是率失真性能的核心杠杆。
链接: https://arxiv.org/abs/2605.02086
作者: Baobing Zhang,Wanxin Sui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注:
Abstract:3D Gaussian splatting (3DGS) is a state-of-the-art representation for real-time photorealistic novel-view synthesis, yet a single high-fidelity scene typically occupies hundreds of megabytes to several gigabytes, exceeding the budgets of mobile, immersive, and volumetric video platforms. Existing 3DGS compression methods (e.g., HAC++, FlexGaussian, LP-3DGS) treat pruning, quantization, and entropy coding as separate stages and rely on hand-tuned heuristics (opacity thresholds, fixed bit-widths, SH truncation), limiting cross-scene generalization and preventing users from specifying a target rate or quality budget. We propose GETA-3DGS, to our knowledge the first end-to-end automatic joint structured pruning and quantization framework for 3DGS. Building on GETA for joint pruning-quantization of deep networks, we contribute: (i) a 3DGS-aware quantization-aware dependency graph (QADG) treating each Gaussian primitive as a group with five attribute sub-nodes and degree-aware SH sub-nodes; (ii) a render-aware saliency fusing transmittance-weighted contribution, screen-space gradient, and pixel coverage into a Gaussian-level importance score; and (iii) a heterogeneous per-attribute mixed-precision scheme co-optimized with structural sparsity under a projected partial saliency-guided (PPSG) descent guarantee. On Mip-NeRF 360, Tanks and Temples, and Deep Blending, GETA-3DGS operates directly on raw Gaussian primitives rather than a post-hoc anchor representation, delivering ~5x storage reduction over Vanilla 3DGS with no per-scene thresholds. Bit-width policy is the dominant rate-distortion lever: a uniform 6-bit cap costs up to -6.74 dB on view-dependent scenes versus our heterogeneous allocation, matching an information-theoretic reverse-water-filling analysis we develop. GETA-3DGS is complementary to existing codecs: entropy coding (HAC++, CompGS) is downstream, so the two can be composed.
[AI-83] Optimization of CV-QKD Under Practical Constraints
【速读】:该论文旨在解决在实际硬件约束条件下(如发射端和接收端有限的有限冲激响应滤波器(FIR filter)抽头数、平均光子数限制及有限数模转换器/模数转换器(DAC/ADC)分辨率)的通信系统性能优化问题。其解决方案的关键在于采用强化学习(reinforcement learning)方法,直接将硬件限制建模为优化过程中的约束条件,从而在真实可行的物理环境中实现显著的性能提升。
链接: https://arxiv.org/abs/2605.02045
作者: Svitlana Matsenko,Amirhossein Ghazisaeidi,Marcin Jarzyna,Konrad Banaszek,Darko Zibar
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:Using reinforcement learning, we optimize for practical hardware constraints, including limited FIR filter taps at the transmitter and receiver, mean photon number and finite DAC/ADC resolution. Under these realistic conditions, the proposed approach achieves significant performance improvements.
[AI-84] VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation
【速读】:该论文旨在解决如何在低成本、模块化硬件平台上实现端到端视觉-语言-动作(Vision-Language-Action, VLA)策略的学习与部署问题,尤其针对实际场景中对易碎物体进行安全抓取的挑战。解决方案的关键在于:首先,构建一个基于ZMQ通信架构的集成系统(VILAS),整合协作机械臂、电驱动夹爪和双目感知模块,支持远程操作、数据采集与策略部署的一体化流程;其次,设计一种基于剪纸结构(kirigami)的软性顺应夹爪扩展件,在无需显式力觉传感的情况下通过可预测的形变实现对脆弱目标的柔和且重复性良好的接触控制;最后,通过在葡萄抓取任务上部署并对比三个先进VLA模型(pi_0、pi_0.5 和 GR00T N1.6)验证了该平台的有效性,表明高质量的操控策略可在低成本硬件上成功训练与落地。
链接: https://arxiv.org/abs/2605.02037
作者: Zijian An,Hadi Khezam,Bill Cai,Ran Yang,Shijie Geng,Yiming Feng, Yue (Luna)Zheng,Lifeng Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.
[AI-85] Conventional Commit Classification using Large Language Models and Prompt Engineering
【速读】:该论文旨在解决传统提交信息(conventional commits)分类任务中依赖大规模标注数据训练机器学习或深度学习模型所带来的高成本问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)通过提示工程(prompt engineering)实现无需训练的分类方法,具体采用零样本(zero-shot)、少样本(few-shot)和思维链(chain-of-thought)三种提示策略,在不进行任何微调的前提下对代码差异(code diffs)进行自动分类。实验表明,少样本提示在准确率上表现最优,且模型规模对分类性能有显著影响,DeepSeek-R1-32B 在三款不同规模模型中表现最佳,验证了大语言模型在降低标注依赖的同时仍能实现高效、准确的 commit 分类。
链接: https://arxiv.org/abs/2605.02033
作者: H. M. Sazzad Quadir,Sakib Al Hasan,Md. Nurul Ahad Tawhid
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional commits provide a structured format for writing commit messages, which improves readability, software maintenance, and enables automation tools such as changelog generators and semantic versioning systems. Existing approaches to conventional commit classification typically rely on ML/DL models trained on large labeled datasets. In this paper, we investigated a training-free alternative by leveraging large language models (LLMs) through prompt engineering. Rather than building a task-specific classifier, we evaluate three prompting strategies, such as zero-shot, few-shot, and chain-of-thought, across three open-source LLMs of varying scale: Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B. Classification is performed directly on code diffs extracted from a balanced dataset of 3,200 commits mined from the InfluxDB repository, without any model fine-tuning. Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification. These findings provide practical guidance for researchers and practitioners seeking to automate commit classification without the overhead of curating and maintaining labeled training data.
[AI-86] nability and Weak Semantics: Modeling Non-uniform Defense – Extended Version KR
【速读】:该论文旨在解决传统抽象论证框架中弱语义(weak semantics)在处理一致性防御时过于刚性的问题,即尽管这些语义允许忽略不合理攻击(如自败攻击),但仍要求对所有合理攻击提供统一的防御策略,这在实际辩论场景中可能不切实际。其解决方案的关键在于提出“可维持性”(tenability)这一类基于对话的语义框架,通过引入三种变体——静态可维持性、可维持性和强可维持性——以形式化论证者如何在面对对手提出的任意无冲突攻击时,通过策略性回应来维护指定论点或论点集合的有效性。该方法基于单调承诺博弈(monotone commitment games)建模辩论过程,明确区分了辩手之间的义务约束,并能自然刻画三类典型辩论模式:自败攻击、漂浮赋值与析取恢复,从而在逻辑上和计算复杂度上区别于现有弱语义体系。
链接: https://arxiv.org/abs/2605.02024
作者: Uri Andrews,Luca San Mauro,John Spoerl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the extended version of a paper accepted to KR (Knowledge Representation) 2026
Abstract:In Dung-style abstract argumentation, various semantics capture notions of acceptability of arguments. The admissibility semantics capture the notion that an argument can be consistently defended from any potential counterargument. Weak semantics often relax the demands of admissibility by restricting which counterarguments must be taken seriously (e.g., discounting self-defeating or otherwise incoherent attacks). Many prominent proposals for weak semantics remain extension-based in a stronger sense. While these semantics discount attacks from arguments which are considered unreasonable, they still require a uniform defense against all reasonable arguments, even if they are collectively inconsistent. This uniformity can be too demanding when defensibility is inherently strategic, and thus the appropriate reply depends on the opponent’s line of attack. We introduce tenability, a family of dialogue-based semantics that formalize when a designated argument (or a set of arguments) can be maintained in debate by a proponent against any conflict-free attack which the opponent may present. The approach is motivated by three natural benchmark patterns: self-defeating attack, floating assignment, and disjunctive reinstatement, on which tenability behaves differently from all weak semantics previously considered in the literature. We define three variants – static tenability, tenability, and strong tenability – via monotone commitment games over finite conflict-free moves, differing in the obligations imposed on the disputants. We establish the relative strength of these notions, prove implications and separations with previously studied weak semantics, and we analyze computational complexity on finite frameworks: deciding static tenability is \Pi^P_2 -complete, while deciding tenability and strong tenability is PSPACE-complete. Comments: This is the extended version of a paper accepted to KR (Knowledge Representation) 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.02024 [cs.AI] (or arXiv:2605.02024v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.02024 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-87] Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective ICML2026
【速读】:该论文旨在解决当前人工智能(AI)系统在可靠性方面的核心缺陷:AI不仅学习显性知识(如文献、文档和结构化数据库),还无差别地吸收隐性知识(implicit knowledge,如推理模式、调试过程和中间步骤),而这些隐性知识因文档成本高于其感知价值而难以被外部化,导致AI可能习得有害偏见却无法被有效验证。现有可靠性方法仅能验证显性知识来源,无法评估AI的核心能力——推理、判断与直觉——从而形成关键盲区。论文提出的解决方案是引入“知识对象”(Knowledge Objects, KOs),即结构化的知识载体,将隐性知识转化为人类可审查、可验证并可认可的形式,从根本上改变验证的经济性,使原本不可行的人类验证成为可能,进而通过持续积累的人类验证提升AI系统的长期可靠性。
链接: https://arxiv.org/abs/2605.02010
作者: Hengyu Liu,Tianyi Li,Zhihong Cui,Yushuai Li,Zhangkai Wu,Torben Bach Pedersen,Kristian Torp,Christian S. Jensen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 (Position Paper Track). 13 pages, 2 figures, 1 table
Abstract:This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value – yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) – structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
[AI-88] Personalized Digital Health Modeling with Adaptive Support Users
【速读】:该论文旨在解决数字健康领域中个性化模型因用户特定数据稀缺且噪声大而导致的性能瓶颈问题。现有方法通常依赖于人群预训练或来自相似用户的迁移学习,易产生偏差并削弱泛化能力。其解决方案的关键在于提出一种统一的个性化框架,通过自适应加权支持用户(包括相似与不相似个体)来训练个人模型;该框架的目标函数融合了个人损失、基于相似度加权的相似用户迁移项以及来自不相似用户的对比正则化项,以抑制误导性关联。此外,采用迭代优化算法联合更新模型参数与用户相似性权重,从而在多个真实世界数字健康数据集上显著提升性能,尤其在低数据场景下实现约25%的均方根误差(RMSE)降低,并提供可解释的数据选择指导。
链接: https://arxiv.org/abs/2605.02004
作者: Zhongqi Yang,Mahkameh Rasouli,Neda Mohseni,Yong Huang,Iman Azimi,Amir M. Rahmani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized models are essential in digital health because individuals exhibit substantial physiological and behavioral heterogeneity. Yet personalization is limited by scarce and noisy user-specific data. Most existing methods rely on population pretraining or data from similar users only, which can lead to biased transfer and weak generalization. We propose a unified personalization framework that trains a personal model using adaptively weighted support users, including both similar and dissimilar individuals. The objective integrates personal loss, similarity-weighted transfer from similar users, and contrastive regularization from dissimilar users to suppress misleading correlations. An iterative optimization algorithm jointly updates model parameters and user similarity weights. Experiments on six tasks across four real-world digital health datasets show consistent improvements over population and personalized baselines. The method achieves up to 10% lower RMSE on large-scale datasets and approximately 25% lower RMSE in low-data settings. The learned adaptive weights improve data efficiency and provide interpretable guidance for targeted data selection.
[AI-89] RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
【速读】:该论文旨在解决拉曼光谱(Raman spectroscopy)领域中机器学习(Machine Learning, ML)应用缺乏标准化基准的问题,具体表现为数据集碎片化、评估标准不统一以及模型难以捕捉光谱数据结构。其解决方案的关键在于提出首个大规模、可复现的基准测试平台 RamanBench,该平台整合了74个数据集(含16个首次发布的数据集),涵盖分类与回归任务,并提供统一的数据访问接口、评估协议和代码库,同时设立实时排行榜以促进公平比较。通过在28种模型上进行标准化评测,发现表格式基础模型(Tabular Foundation Model, TFM)表现最优,但无单一方法能在不同数据集间实现泛化,揭示了当前方法的核心局限性,从而推动社区持续贡献新方法以加速医学诊断、生物研究和材料科学等关键领域的进展。
链接: https://arxiv.org/abs/2605.02003
作者: Mario Koddenbrock,Christoph Lange,Robin Legner,Martin Jäger,Martin Kögler,Mariano N. Cruz Bournazou,Peter Neubauer,Felix Biessmann,Erik Rodner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Learning (ML) has transformed many scientific fields, yet key applications still lack standardized benchmarks. Raman spectroscopy, a widely used technique for non-invasive molecular analysis, is one such field where progress is limited by fragmented datasets, inconsistent evaluation, and models that fail to capture the structure of spectral data. We introduce RamanBench, the first large-scale, fully reproducible benchmark for ML on Raman spectroscopy, consisting of streamlined data access, evaluation protocols and code, as well as a live leaderboard. It unifies 74 datasets (including 16 first released with this benchmark) across four domains, comprising 325,668 spectra and spanning classification and regression tasks under diverse experimental conditions. We benchmark 28 models under a standardized protocol, including classical methods (e.g., PLS), Raman-specific (e.g., RamanNet), Tabular Foundation Model (TFM) (e.g., TabPFN), and time-series approaches (e.g., ROCKET). TFM consistently outperform domain-specific and gradient boosting baselines, while time-series models remain competitive. However, no method generalizes across datasets, revealing a fundamental gap. Therefore, we invite the community to contribute new approaches to our living benchmark, with the potential to accelerate advances in critical applications such as medical diagnostics, biological research, and materials science.
[AI-90] umorXAI: Self-Supervised Deep Learning Framework for Explainable Brain MRI Tumor Classification
【速读】:该论文旨在解决脑肿瘤分类中因肿瘤异质性和标注数据匮乏导致监督式深度学习方法受限的问题。其解决方案的关键在于采用自监督学习(Self-Supervised Learning, SSL)框架,利用无标签的磁共振成像(MRI)数据进行预训练,进而提升在小样本标注场景下的分类性能。实验表明,基于ResNet-50主干网络的四种SSL方法(SimCLR、BYOL、DINO和Moco v3)在包含4,448张MRI图像的公开数据集上均表现出优异性能,其中SimCLR达到99.64%的准确率、精确率、召回率和F1分数;更重要的是,在标签有限条件下,SSL预训练模型显著优于传统监督基线模型,同时结合可解释人工智能技术(如Grad-CAM、Grad-CAM++和EigenCAM)增强了模型决策的可视化与可信度,验证了SSL在医学影像诊断中的可扩展性与可靠性。
链接: https://arxiv.org/abs/2605.01999
作者: Abrar Hossain Zahin,Amit Kumar Saha,Tanvir Mridha,Saifur Rahman,Jannatul Ferdous Prome,Raima Husna,Israt Jahan,Ahmed Wasif Reza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, 6 Tables
Abstract:Classifying brain tumors using magnetic resonance imaging (MRI) is crucial for early diagnosis and treatment; however, tumor heterogeneity and a dearth of annotated datasets restrict the use of supervised deep learning approaches. In this work, we use self-supervised learning (SSL) to study multi-class brain tumor classification. Using a ResNet-50 backbone, we evaluate four SSL frameworks including SimCLR, BYOL, DINO, and Moco v3 on a publicly available dataset of 4,448 MRIs with 17 distinct tumor types. On the dataset, SimCLR achieved 99.64% accuracy, 99.64% precision, 99.64% recall, and 99.64% F1-score. The workflow includes preprocessing, fine-tuning, linear evaluation, and SSL pretraining with data augmentations. Results show that, when labels are limited, SSL-pretrained models outperform supervised baselines in terms of F1-score, recall, accuracy, and precision. Additionally, by providing visual insights into model decisions, Explainable AI techniques (Grad-CAM, Grad-CAM++, EigenCAM) enhance interpretability. These results demonstrate SSL’s scalability and dependability in diagnosing brain tumors from unlabeled medical data.
[AI-91] 12 Angry AI Agents : Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
【速读】:该论文旨在探究大型语言模型(Large Language Models, LLMs)在模拟人类陪审团决策过程中的 deliberation(审议)能力,特别是验证是否存在类似电影《十二怒汉》中少数派逐步说服多数派的动态过程。其核心问题是:当前LLMs是否具备足够灵活的推理与观点调整能力以实现有效的群体审议,而非陷入“锚定效应”(anchoring)导致的僵局。解决方案的关键在于构建了一个基于多智能体框架的基准测试系统,其中12个代理分别扮演电影中角色并依据忠实于原作的人设进行辩论,并对比两种代表RLHF(强化学习从人类反馈)对齐强度差异的模型——GPT-4o(强对齐)与Llama-4-Scout(弱对齐)——在不同提示条件下的行为表现。结果显示,仅弱对齐模型展现出显著的审议灵活性和达成非有罪判决的能力,表明对齐强度而非模型能力才是决定多智能体审议效果的关键因素。
链接: https://arxiv.org/abs/2605.01986
作者: Ahmet Bahaddin Ersoz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:What if the twelve jurors of Sidney Lumet’s 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone’s mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film’s murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film’s central event, gradual minority-to-majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT-4o produces a mean of 1.0 vote changes per run across all conditions, while Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt), and is the only model to reach a NOT_GUILTY verdict (1 of 3 runs in the no-initial-vote condition). The same ``open-minded’’ instruction is internalized by Llama and ignored by GPT-4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi-agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury-of-LLMs evaluation and multi-agent debate.
[AI-92] rojan Hippo: Weaponizing Agent Memory for Data Exfiltration
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理系统中长期记忆机制引入的新型持久化内存攻击问题,特别是针对“Trojan Hippo”这类在真实威胁场景下可触发的隐蔽攻击。此类攻击通过单次不可信工具调用(如伪造邮件)将潜伏载荷植入代理的长期记忆,在用户后续讨论金融、健康或身份等敏感话题时激活,并泄露高价值个人数据。解决方案的关键在于提出一个动态评估框架,包含两个核心组件:一是基于OpenEvolve的自适应红队测试基准,用于持续演化攻击以压力测试防御机制与不同记忆后端;二是首个面向能力感知的安全性/效用权衡分析方法,支持根据使用场景理性部署防御策略。实验证明,该框架能有效识别并量化当前前沿模型(如OpenAI和Google模型)在四种典型记忆架构(显式工具记忆、代理记忆、检索增强生成RAG、滑动窗口上下文)下的漏洞,且所测试的四类基础安全防御措施可将攻击成功率降至0–5%,但伴随显著的效用损失,凸显了实际部署中安全-效用权衡仍是开放挑战。
链接: https://arxiv.org/abs/2605.01970
作者: Debeshee Das,Julien Piet,Darya Kaviani,Luca Beurer-Kellner,Florian Tramèr,David Wagner
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent’s long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and this http URL introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100 percent ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5 percent), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01970 [cs.CR] (or arXiv:2605.01970v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.01970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-93] RAP: Tail-aware Ranking Attack for World-Model Planning
【速读】:该论文旨在解决世界模型(World Models)在长期规划过程中所面临的新型后门攻击安全问题。传统后门攻击主要针对局部特征、单步预测或即时策略输出,但对具备动态先验和规划能力的世界模型效果有限,因其能吸收或抵消浅层扰动。论文指出,世界模型的潜在脆弱性源于其想象轨迹的长尾排序结构——通过干扰少数关键决策轨迹的相对顺序,即可系统性地劫持规划结果。解决方案的关键在于提出TRAP框架:该框架采用一种面向尾部的排序损失函数,聚焦于决策关键轨迹的优化;并引入双重门控机制以稳定训练过程,并精确控制攻击惩罚的触发时机与作用范围。实验表明,TRAP能在DreamerV3和TD-MPC2等多任务场景下持续诱导行为偏离并显著降低性能,凸显了对基于世界模型代理进行专门安全评估的必要性。
链接: https://arxiv.org/abs/2605.01950
作者: Siyuan Duan,Ke Zhang,Xizhao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:World models enable long-horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination-driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one-step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. To exploit this vulnerability, we propose TRAP, a backdoor attack framework for world models that targets imagined trajectory ranking. TRAP combines a tail-aware ranking loss to focus optimization on decision-critical trajectories with dual gating mechanisms that stabilize optimization and regulate when and where the attack penalty is applied. Under trigger conditions, TRAP alters the relative ranking of imagined trajectories to redirect planning outcomes, while largely maintaining the normal ranking structure on clean inputs. Experiments on DreamerV3 and TD-MPC2 across diverse tasks show that TRAP consistently induces sustained behavioral deviations and significant performance degradation, highlighting the need for dedicated security evaluation of world-model-based agents. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01950 [cs.LG] (or arXiv:2605.01950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-94] PepSpecBench: A Unified Evaluation Benchmark for Peptide Tandem Mass Spectrometry Prediction
【速读】:该论文旨在解决肽段串联质谱(peptide MS/MS)谱预测领域中存在的三大评估挑战:一是数据预处理不一致和模型输出空间不兼容导致的公平比较困难;二是数据划分策略 flawed 导致序列泄露并高估性能;三是缺乏跨物种基准测试和对实验条件敏感性的系统评估。解决方案的关键在于提出 PepSpecBench,一个统一的基准平台,其核心创新包括:标准化多个公共数据集的数据预处理流程、采用严格的骨架-不相交(backbone-disjoint)划分策略防止序列泄露、在共享碎片离子表示空间中评估多种模型架构,并引入多物种评价套件及基于物理意义的元数据扰动探针以量化模型鲁棒性和仪器感知能力。
链接: https://arxiv.org/abs/2605.01945
作者: Zhiwen Yang,Pan Liu,Yifan Li,Yunhua Zhong,Jun Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures
Abstract:Tandem mass spectrometry provides a high-throughput framework for identifying and quantifying proteins in complex biological samples. In computational proteomics, predicting peptide MS/MS spectra is a critical task, enabling downstream applications such as large-scale peptide identification and quantification. While deep learning architectures have substantially improved prediction accuracy, three evaluation challenges obscure the true progress of the field. First, inconsistent data preprocessing and incompatible model output spaces hinder fair model comparison. Second, flawed data splitting strategies can permit hidden sequence leakage and inflate reported performance. Third, existing evaluations typically lack comprehensive cross-species benchmarking and systematic assessment of model robustness to influential experimental conditions. To address these challenges, we propose PepSpecBench, a unified benchmark for peptide MS/MS spectrum prediction. PepSpecBench standardizes data preprocessing across complementary public datasets, enforces a strict backbone-disjoint splitting strategy to eliminate sequence leakage, and evaluates diverse architectures within a shared fragment-ion representation space. It further introduces a comprehensive multi-species evaluation suite and physically grounded metadata perturbation probes to assess model robustness and instrument awareness. We uncover previously unrecognized performance discrepancies and robustness limitations across six representative models, providing actionable insights for future model design, evaluation and practical deployment.
[AI-95] Stochastic Sparse Attention for Memory-Bound Inference ICML2026
【速读】:该论文旨在解决长上下文场景下自回归解码(autoregressive decoding)因频繁访问键值缓存(KV cache)而导致的带宽瓶颈问题,即生成每个标记(token)时需读取全部 $ n_k $ 个 key 和 value 向量,显著限制了推理效率。解决方案的关键在于提出 Stochastic Additive No-mulT Attention (SANTA),通过从 softmax 后的注意力分布中采样 $ S \ll n_k $ 个索引,并仅聚合对应的 value 行,从而实现对 post-softmax value aggregation 的无偏估计;同时用 gather-and-add 操作替代传统的 multiply-accumulate 计算,大幅降低计算复杂度与内存带宽需求。进一步引入分层采样(stratified sampling)以减少方差并适配 GPU 并行架构,在 NVIDIA RTX 6000 Ada 上实现了比 FlashInfer 和 FlashDecoding 快 1.5 倍的注意力核函数加速,且在 32k token 上保持基线精度。此外,论文还提出 Bernoulli $ qK^\mathsf{T} $ 采样作为互补策略,用于稀疏化 score 阶段的 key-feature 访问,进一步提升能效。两种方法均与上游技术如三值量化、低秩投影和 KV 缓存压缩正交,共同推动面向稀疏、免乘法器、节能的推理方向发展。
链接: https://arxiv.org/abs/2605.01910
作者: Kyle Lee,Corentin Delacour,Kevin Callahan-Coray,Kyle Jiang,Can Yaras,Samet Oymak,Tathagata Srimani,Kerem Y. Camsari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted to ICML 2026. Code available at this https URL
Abstract:Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all n_k key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling S \ll n_k indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating 1.5\times decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli qK^\mathsfT sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: this https URL
[AI-96] Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面临基于角色(persona-based)越狱攻击时的安全性不足问题。当前安全对齐技术难以有效防御此类攻击,且缺乏系统性的防御机制约束。解决方案的关键在于提出一种对抗自博弈框架——Persona-Invariant Alignment (PIA),其核心由两部分组成:一是在攻击侧引入Persona Lineage Evolution (PLE),通过谱系信用传播高效探索高风险角色空间;二是在防御侧采用Persona-Invariant Consistency Learning (PICL),基于结构分离假设,利用单边KL散度约束实现安全决策与角色上下文的结构解耦,从而在角色变换下保持行为安全性。实验证明,该方法显著降低攻击成功率(ASR)的同时维持模型通用能力,验证了其鲁棒性和优越性。
链接: https://arxiv.org/abs/2605.01899
作者: Jiajia Li,Xiaoyu Wen,Zhongtian Ma,Shuyue Hu,Qiaosheng Zhang,Zhen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model’s general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at this https URL.
[AI-97] Sheaf-Theoretic Planning : A Categorical Foundation for Resilient Multi-Agent Autonomous Systems
【速读】:该论文旨在解决多智能体系统(Multi-Agent System, MAS)在面对物理世界中随机性和对抗性因素时,如何实现鲁棒自主导航的问题。传统方法依赖于符号逻辑与控制理论的结合,采用事件演算(event calculus)和情境演算(situation calculus)等单体逻辑模型来刻画动作、变化及时间持久性,但这些方法受限于封闭世界假设,在遭遇未观测到的干预、计划中断或信念与现实不一致的情形下失效。论文提出的解决方案是基于拓扑理论(topos theory)与层语义(sheaf semantics)的层理论规划(Sheaf-Theoretic Planning, STP),其关键在于将多智能体协调问题建模为具有局部-全局一致性约束的层结构,从而突破传统逻辑框架对封闭世界的依赖,为构建具备适应性和鲁棒性的下一代自主系统提供数学基础与实现路径。
链接: https://arxiv.org/abs/2605.01879
作者: Manuel Hernández,Eduardo Sánchez-Soto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The challenge of engineering autonomous agents capable of navigating the stochastic and adversarial nature of the physical world has historically resided at the intersection of symbolic logic and control theory. Traditional multi-agent system (MAS) frameworks have relied heavily on monolithic logical models – primarily variations of the event calculus and situation calculus – to represent action, change, and temporal persistence. While these classical systems provide robust solutions to the frame problem through mechanisms like circumscription and successor state axioms, they are inherently limited by a closed-world assumption that fails in the face of unobserved agent interventions, plan interruptions, and divergent belief-reality states. The paradigm of Sheaf-Theoretic Planning (STP) emerges as a transformative alternative, grounding the problem of multi-agent coordination under the mathematical structures of topos theory and sheaf semantics. This report provides an exhaustive analysis, justification, and extension of the STP framework, exploring its categorical foundations, implementation feasibility, and role in the future of resilient autonomous systems.
[AI-98] Leverag ing Data Symmetries to Select an Optimal Subset of Training Data under Label Noise
【速读】:该论文旨在解决在存在标签噪声的训练数据中,如何有效识别出一个子集以训练出性能接近无噪声数据训练结果的分类器的问题。其核心挑战在于:当数据维度较高时,传统基于k近邻(k-NN)的方法(如cutstats)在筛选低噪声样本时表现不佳。解决方案的关键在于引入数据不变性(data invariance)和潜在对称性知识,通过利用这些先验信息来提升k-NN的准确性,从而更可靠地识别出高质量子集;进一步地,在现实场景中即使仅部分掌握不变性信息,学习到的不变表示仍能显著改善子集选择效果,使最终模型逼近贝叶斯最优分类器(Bayes optimal classifier)。
链接: https://arxiv.org/abs/2605.01874
作者: Kumar Shubham,Pavan Karjol,Kiran M K,Prathosh AP
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of machine learning models often relies on large labeled datasets; however, data collected from diverse sources can contain label noise. Recent work has shown that, in noisy settings, there may exist a subset of the training data on which models can achieve performance comparable to training on a noise-free dataset. A widely used method for identifying such subsets is cutstats, which employs k-nearest neighbors (k-NN) to detect low-noise samples. However, its performance on high-dimensional data remains largely unexplored. In this work, we formally establish that the performance of a classifier trained on a subset of a noisy dataset selected via cutstats is influenced by the accuracy of k-NN. We further demonstrate that, in noisy environments, exploiting data invariance and knowledge of underlying symmetries can significantly enhance the performance of k-NN, bringing it closer to the Bayes optimal classifier even in high-dimensional regimes. Finally, we show that for real-world scenarios, where information about the underlying invariance is only partially known, learnt invariant representations can still facilitate the identification of near-optimal subsets.
[AI-99] ShiftLIF: Efficient Multi-Level Spiking Neurons with Power-of-Two Quantization
【速读】:该论文旨在解决标准脉冲神经网络(Spiking Neural Networks, SNNs)中泄漏积分与发放(Leaky Integrate-and-Fire, LIF)神经元仅通过二进制脉冲通信所导致的信息表达能力受限的问题。现有改进方案虽引入多级脉冲以提升信息传输效率,但常依赖均匀量化策略,这与膜电位分布不匹配,或引入高能耗的突触乘法运算。其解决方案的关键在于提出ShiftLIF神经元模型,该模型将膜电位映射至对数间隔的2的幂次脉冲集合,从而在小幅度区域实现更精细的表示,同时利用位移和累加操作实现无乘法器的突触计算,兼顾了脉冲级表达能力的提升与硬件友好性,显著优化了跨模态边缘感知任务中的准确率-能效权衡。
链接: https://arxiv.org/abs/2605.01866
作者: Kaiwen Tang,Di Yu,Jiaqi Zheng,Changze Lv,Qianhui Liu,Zhanglu Yan,Weng-Fai Wong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spiking neural networks (SNNs) are promising for edge sensing due to their event-driven computation and temporal filtering capability. However, standard leaky integrate-and-fire (LIF) neurons communicate only through binary spikes, which severely limit representational capacity. Existing multi-level spiking neurons improve information transmission, but often rely on uniform quantization that mismatches membrane-potential distributions or introduces costly synaptic multiplications. In this paper, we propose ShiftLIF, a multi-level spiking neuron that maps membrane potentials to a logarithmically spaced power-of-two spike set. This design provides finer representation in the small-amplitude regime, where membrane potentials are densely concentrated, while enabling multiplier-free synaptic computation through bit-shift and accumulation operations. As a result, ShiftLIF improves spike-level expressiveness without sacrificing the hardware-friendly nature of standard SNN computation. We evaluate ShiftLIF on 10 datasets spanning wireless, acoustic, motion, and visual sensing tasks. Results show that ShiftLIF consistently matches or exceeds the accuracy of existing multi-level spiking neurons while maintaining synaptic energy consumption close to standard binary LIF. These results indicate that ShiftLIF provides a favorable accuracy-efficiency trade-off for cross-modal edge sensing.
[AI-100] NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
【速读】:该论文旨在解决多轮任务中代理(agent)行为一致性评估不足的问题,即传统仅基于结果(outcome-only)的评估方法无法准确判断模型是否在多步交互中维持了必要的承诺(commitment),从而导致对模型真实能力的误判。其解决方案的关键在于提出 NeuroState-Bench 基准,通过定义明确的“侧查询探针”(side-query probes)来直接测量模型状态(state)的完整性,而非依赖隐含激活的推断;该方法实现了人类校准的承诺完整性(commitment integrity)量化,并在包含144个确定性任务和306个探针的基准体系中验证了其有效性,揭示了任务成功率与承诺完整性之间存在显著分化,且后者更具稳定性与诊断价值。
链接: https://arxiv.org/abs/2605.01847
作者: Jia Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 11 figures
Abstract:Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence-free score HCCIS-CORE reaches 0.8469 AUC and 0.6992 PR-AUC for post-probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS-FULL reaches 0.7997 AUC and 0.6410 PR-AUC. Probe accuracy and state drift achieve slightly higher ROC-AUC, 0.8587, and better Brier/ECE, while HCCIS-CORE has substantially higher point-estimate PR-AUC and remains more closely tied to the benchmark’s intended construct. The exploratory neural-augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState-Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local-only subset.
[AI-101] Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning
【速读】:该论文旨在解决对比学习(Contrastive Learning, CL)中数据投毒后门攻击的脆弱性问题,同时探索其在数据集知识产权(IP)保护中的潜在应用价值。现有研究虽揭示了CL模型对后门攻击的敏感性,但对其泛化能力与鲁棒性的系统评估仍不足。论文的关键解决方案在于发现触发样本(trigger samples)与干净样本之间存在可区分的统计差异,并将其重新利用为一种水印机制;进一步提出多层级水印方案,适配特征级、软标签和硬标签输出,通过统一密度度量实现统计验证,从而在保证水印可信性的同时,平衡保真度、可验证性和鲁棒性之间的权衡。
链接: https://arxiv.org/abs/2605.01834
作者: Zhiyang Dai,Yansong Gao,Boyu Kuang,Haodong Li,Qi Chang,Gaurav Varshney,Derek Abbott,Anmin Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Contrastive learning (CL) reduces annotation cost via auto-derived supervisory signals. Since large-scale in-house CL datasets are infeasible, reliance on third-party or internet data is common. Recent studies show CL models are vulnerable to data-poisoning backdoor attacks, but their generalization and robustness are underexplored. We systematically evaluate existing data-poisoning backdoor attacks on CL, revealing limitations: poor dataset adaptability, low success rates, limited portability, and restrictive assumptions (e.g., downstream task knowledge). Interestingly, trigger samples exhibit distinguishable statistical divergence from clean samples, which inspires repurposing it as a watermark for dataset IP protection. Direct repurposing is challenging due to low success rates; we overcome this by statistical verification using a unified density metric. We further propose a multi-level watermarking scheme adapting to feature-level, soft-label, or hard-label outputs in CL. Experiments show some backdoor attacks can be repurposed as effective watermarks with trade-offs among fidelity, verifiability, and robustness. This work demonstrates weak backdoor effects become reliable signals for dataset IP protection in challenging CL settings.
[AI-102] Remote Action Generation: Remote Control with Minimal Communication
【速读】:该论文旨在解决在通信受限条件下,控制器通过有限带宽向缺乏直接奖励访问权限的执行者(actors)远程传递动作指导的问题,尤其在动作空间较大或为连续空间时,传统直接传输动作信息会导致通信效率低下。解决方案的关键在于提出一种名为“引导远程动作采样策略”(Guided Remote Action Sampling Policy, GRASP)的新框架:控制器仅发送最小化信息以引导执行者本地采样动作,而非传输完整动作指令;执行者基于控制器的目标策略进行重要性采样生成动作,并将接收到的引导信号作为监督学习数据来优化自身采样能力,从而逐步减少未来对通信的依赖。该方法实现了显著的通信压缩效果,在所有实验中平均减少12倍的数据传输量(连续动作空间达50倍),相较奖励传输方式更是降低41倍。
链接: https://arxiv.org/abs/2605.01833
作者: Szymon Kobus,Deniz Gündüz
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We address the challenge of remote control where one or more actors, lacking direct reward access, are steered by a controller over a communication-constrained channel. The controller learns an optimal policy from observed rewards and communicates action guidance to the actors, which becomes demanding for large or continuous action spaces. To achieve rate-efficient communication throughout this interactive learning and control process, we introduce a novel framework leveraging remote generation. Instead of transmitting full action specifications, the controller sends minimal information, enabling the actors to locally generate actions by sampling from the controller’s evolving target policy. This guided sampling is facilitated by an importance sampling approach. Concurrently, the actors use the received guidance as supervised learning data to learn the controller’s policy. This actor-side learning improves their local sampling capabilities, progressively reducing future communication needs. Our solution, Guided Remote Action Sampling Policy (GRASP), demonstrates significant communication reduction, achieving an average 12-fold data reduction across all experiments (50-fold for continuous action spaces) compared to direct action transmission, and a 41-fold reduction compared to reward transmission.
[AI-103] Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards
【速读】:该论文旨在解决当前基于单样本强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)方法中,因采用静态奖励方差启发式选择训练实例而导致的迁移价值误导问题。现有方法依赖历史奖励方差作为筛选标准,但研究发现该指标并不能准确反映问题对模型推理能力提升的潜力。解决方案的关键在于提出一种Selector-Guided Autonomous Curriculum (SGAC) 方法,通过一个可学习的选择器模型(selector model)在多维特征空间(包括成功概率、奖励方差、输出分歧度(entropy)和语义难度等级)上动态评估候选问题,并以输出分歧度(即熵)作为最强预测因子来排序和选取最优问题进行微爆式(micro-bursts)1-shot GRPO 训练,从而实现更高效的自动课程学习与数据筛选机制。实证结果表明,该方法在Hendrycks MATH基准上使Qwen2.5-Math-1.5B模型准确率提升至68.0%,显著优于现有最先进方法(64.0%)和原始1-shot RLVR检查点(66.0%),验证了基于熵的数据智能筛选在极小数据条件下对推理能力提升的有效性。
链接: https://arxiv.org/abs/2605.01823
作者: Rudray Dave,Vedang Dubey,Smit Deoghare,Sudhakar Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-of-the-art 1-shot RLVR models adopt heuristics for selecting instances, mostly based on historical variance in rewards, which we find to be inherently misleading as a measure of transferability value. In this paper, we propose a Selector-Guided Autonomous Curriculum (SGAC) approach, which employs a learnable selector model on a multi-dimensional feature space consisting of success probability, reward variance, output disagreement (entropy), and semantic difficulty level, instead of the static reward variance heuristic. In our empirical evaluation on pools of candidate problems, we observed that output disagreement, rather than reward variance, is the strongest predictor of reasoning gains in subsequent iterations. Leveraging this finding, we develop an autonomous curriculum algorithm for dynamically siphoning candidate problems from a large pool, ranking them by the learned selector, and running micro-bursts of 1-shot GRPO. Our framework is evaluated using the Hendrycks MATH benchmark, with the Qwen2.5-Math-1.5B model serving as the baseline. Our framework obtains an accuracy of 68.0% on the hold-out dataset, which is better than the accuracy obtained from the state-of-the-art model, 64.0%, as well as the 1-shot RLVR checkpoint proposed by Wang et al., which achieved an accuracy of 66.0%. The results confirm that entropy-based intelligent data curation leads to strict reasoning improvement over static training methods, particularly in severely limited data conditions.
[AI-104] Federated Semi-Supervised Graph Neural Networks with Prototype-Guided Pseudo-Labeling for Privacy-Preserving Gestational Diabetes Mellitus Prediction
【速读】:该论文旨在解决妊娠期糖尿病(Gestational Diabetes Mellitus, GDM)早期风险分层中面临的两大现实挑战:标签稀缺性(label scarcity)和数据隐私问题(data privacy),即电子健康记录(Electronic Health Records, EHR)中大量缺乏确诊标签,且医院间因隐私限制无法共享患者级数据。其解决方案的关键在于提出一种隐私保护的联邦半监督学习框架 FedTGNN-SS,通过以下核心机制实现:(1) 基于原型引导的伪标签生成与邻域一致性约束,有效利用未标注样本;(2) 自适应图结构优化,定期用学习到的嵌入更新 k-最近邻(k-nearest-neighbor, k-NN)图以提升拓扑准确性;(3) 仅对连续变量应用临床感知的一致性增强(clinical-aware consistency augmentation),保留医学合理性;(4) 安全的原型共享策略,仅交换类别级别的中心点(class-level centroids),保障数据隐私。该方法在多个糖尿病相关数据集上显著优于11种联邦基线模型,并在极端标签缺失场景下仍保持高AUROC性能。
链接: https://arxiv.org/abs/2605.01810
作者: G. Victor Daniela,A. Mallikarjuna Reddya,Uday Kumar Addankia,Sridhar Reddy Gogua,Sravanth Kumar Ramakuria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Gestational Diabetes Mellitus (GDM) is a high-prevalence pregnancy complication that requires accurate early risk stratification to reduce maternal and fetal morbidity. However, real-world clinical deployment of machine learning is hindered by two coupled constraints: (i) label scarcity, where a large fraction of electronic health records (EHR) lack confirmed diagnostic labels, and (ii) data privacy, which prevents sharing patient-level data across hospitals. This paper proposes FedTGNN-SS, a privacy-preserving federated semi-supervised framework for clinical tabular EHR. Each hospital builds a local k-nearest-neighbor patient similarity graph and trains a topology-adaptive GNN encoder. To robustly exploit unlabeled records, FedTGNN-SS combines (1) prototype-guided pseudo-labeling with neighborhood agreement, (2) adaptive graph refinement that periodically updates the k-NN graph using learned embeddings, (3) clinical-aware consistency augmentation applied only to continuous variables, and (4) privacy-safe prototype sharing that exchanges only class-level centroids. Across three diabetes-related datasets (GDM: N = 3,525; Pima: N = 768; Early Stage: N = 520) under 10%-80% missing labels per silo, FedTGNN-SS achieves 56 significant wins ( p 0.05 ) against 11 federated baselines and attains strong AUROC under extreme scarcity (Pima: 0.8037 at 80% missing, Early Stage: 0.9634 at 80% missing).
[AI-105] MD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
【速读】:该论文旨在解决音乐与舞蹈协同生成(music-dance co-generation)任务中缺乏有效评估标准和模型优化方向的问题,尤其关注跨模态节奏对齐(rhythmic alignment)这一关键挑战。当前主流的音频-视觉一致性指标无法准确衡量音乐节奏与舞蹈动作在细粒度时间尺度上的耦合关系,导致现有生成模型虽能产出高质量音视频内容,但在节奏同步性方面表现不稳定。解决方案的关键在于提出TMD-Bench基准测试体系,其融合可计算的物理指标与感知层面的多模态判断,并基于节奏对齐的数据集和结构化的音乐描述工具(Music Captioner)构建统一评估框架;同时引入RhyJAM作为统一基线模型,在节奏对齐数据上训练后实现了与商业模型相当的单模态质量,且在节拍级同步性上显著优于现有方法,验证了显式优化节奏与运动连贯性的可行性。
链接: https://arxiv.org/abs/2605.01809
作者: Xiaoda Yang,Majun Zhang,Changhao Pan,Nick Huang,Yang Yuguang,Fan Zhuo,Pengfei Zhou,Jin Zhou,Sizhe Shan,Shan Yang,Miles Yang,Yang You,Zhou Zhao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music-dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music Captioner for structured music semantics. TMD-Bench further reveals that (i) modern commercial audio-visual models, such as Veo 3 and Sora 2, produce high-quality music and video, while rhythmic coupling remains less consistently optimized and leaves room for improvement, and (ii) our unified baseline RhyJAM trained on rhythm-aligned data achieves competitive beat-level synchronization while maintaining competitive unimodal fidelity. This presents prospects for building next-generation music-dance models that explicitly optimize rhythmic and kinetic coherence.
[AI-106] Neural Decision-Propagation for Answer Set Programming
【速读】:该论文旨在解决答案集编程(Answer Set Programming, ASP)与神经网络融合过程中因依赖传统求解器而导致的可扩展性瓶颈问题。其解决方案的关键在于提出一种新的稳定模型计算方法——决策传播(decision-propagation, DProp),该方法通过交替执行假值决策与真值传播来高效生成稳定模型;进一步地,作者构建了可微分的神经决策传播(Neural DProp, NDProp),利用神经网络进行决策计算、模糊评估实现传播过程,从而在学习决策启发式规则的同时提升神经符号推理的准确性和可扩展性。
链接: https://arxiv.org/abs/2605.01797
作者: Thomas Eiter,Katsumi Inoue,Sota Moriyama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing approaches extend the capabilities of ASP to real world domains, their reasoning pipelines depend on classical solvers, which is a bottleneck for scalability. To tackle this problem, we propose a new method to compute stable models, called decision-propagation (DProp), which alternates falsity decisions and truth propagations. Successful DProp computations are shown to capture the stable model semantics. We then develop Neural DProp (NDProp), a differentiable extension of DProp with neural computation for decisions and fuzzy evaluation for propagations. We evaluate the capabilities of NDProp for learning decision heuristics as well as neuro-symbolic integration, and compare it with existing neuro-symbolic approaches. The results show that NDProp can learn to efficiently compute stable models, and it improves accuracy and scalability on neuro-symbolic benchmarks.
[AI-107] Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation
【速读】:该论文旨在解决高质量音乐生成中结构(structure)与保真度(fidelity)通常被分离在不同表示空间的问题,即传统方法往往先建模高层结构,再通过扩散或神经解码阶段重建细节。其解决方案的关键在于构建一个64层的残差向量量化(Residual Vector Quantization, RVQ)声学标记层次结构,并提出一种从粗到精的两阶段生成框架:首先由主干模型生成整首歌曲的粗粒度声学标记,随后超分辨率模型在同一声学标记空间内逐层细化细粒度标记;该超分辨率阶段采用并行时间处理方式,在固定62步推理过程中完成全轨道优化。此外,引入混合注意力训练策略——对齐目标使用因果注意力,层间细化使用全注意力,从而实现歌词对齐与细节重建的联合优化。关键发现是,纯声学标记语言建模即可自然涌现出文本-人声对齐能力,无需额外语义标记阶段,且主干模型预训练初始化显著提升超分辨率模型的收敛速度和最终质量,表明高保真音乐生成可在统一的声学标记层次中逐步实现,无需跨异构表示空间的设计。
链接: https://arxiv.org/abs/2605.01790
作者: Jiafeng Liu,Yuanliang Dong,Hongjia Liu,Yuqing Cheng,Zhancheng Guo,Huijing Liang,Wenbo Zhan,Yuming Sun,Xiaobing Li,Feng Yu,Maosong Sun
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we explore an alternative view: both may be progressively modeled within a single deep acoustic-token hierarchy. To study this, we build a 64-layer residual vector quantization (RVQ) acoustic representation and propose a two-stage coarse-to-fine generation framework. A backbone model first generates coarse acoustic tokens for the full track, and a super-resolution model then completes finer tokens within the same acoustic token space. The super-resolution stage works at full-track scale and refines tokens layer by layer while running in parallel over time, leading to a fixed 62-step inference process. To jointly improve lyric alignment and fine-detail reconstruction, we further introduce hybrid-attention training: the alignment objective uses causal attention, while layer-wise refinement uses full attention. A key finding is that text–vocal alignment can emerge within pure acoustic-token language modeling, without requiring a separate semantic token stage. Moreover, initializing the super-resolution model from the trained backbone significantly improves convergence and final quality. Taken together, our results suggest that high-quality music generation can be effectively pursued without separating structure and fidelity into heterogeneous representation spaces. Instead, both can be progressively modeled within a unified acoustic-token hierarchy, pointing toward a simpler and more unified path to high-quality music generation.
[AI-108] DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents
【速读】:该论文旨在解决视觉数据构建过程中可控性不足的问题,即在图像编辑和多模态理解任务中,高质量标注数据的生成往往依赖于反复迭代的生成、检查、修正、筛选与导出过程,而传统方法缺乏系统化的闭环管理机制。其解决方案的关键在于提出DataEvolver框架,该框架以明确的目标追踪为核心,通过持久化数据产物(artifacts)、有限范围的纠正动作以及接受决策,组织起生成-反馈-修正的闭环流程;具体实现上包含两个耦合循环:单样本内的生成时自校正和跨数据集轮次的验证时自扩展,从而实现从场景感知生成到反馈驱动校正再到双门控验证的渐进式优化路径。
链接: https://arxiv.org/abs/2605.01789
作者: Qisong Zhang(1),Wenzhuo Wu(1),Zhuangzhuang Jia(1),Yunhao Yang(1),Huayu Zhang(2),Xianghao Zang(2),Zhixiang He(2),Zhongjiang He(2),Kongming Liang(1),Zhanyu Ma(1) ((1) School of Artificial Intelligence, Beijing University of Posts and Telecommunications, (2) Institute of Artificial Intelligence (TeleAI), China Telecom)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Constructing controllable visual data is a major bottleneck for image editing and multimodal understanding. Useful supervision is rarely produced by a single rendering pass; instead it emerges through iterative generation, inspection, correction, filtering, and export. We present DataEvolver, a closed-loop visual data engine that organizes this process around explicit goals, persistent artifacts, bounded corrective actions, and acceptance decisions. DataEvolver supports multiple artifact types, including RGB images, masks, depth maps, normal maps, meshes, poses, trajectories, and review traces. In the current release, the system operates through two coupled loops: generation-time self-correction within each sample and validation-time self-expansion across dataset rounds. We validate the framework on an image-level object-rotation setting. With a fixed Qwen-Edit LoRA probe, our final Ours+DualGate model outperforms both the unadapted base model and a public multi-angle LoRA on SpatialEdit and a held-out evaluation set. Ablations show a consistent improvement path from scene-aware generation to feedback-driven correction and dual-gated validation. Beyond the released rotation data, our main contribution is a reusable framework for building visual datasets through explicit goal tracking, review, correction, and acceptance loops.
[AI-109] Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents
【速读】:该论文旨在解决生成式内容(Procedural Content Generation, PCG)在游戏开发中因算法自动生成关卡而导致的评估难题,例如内容失衡、路径阻塞、重复性过高或技术上不可解等问题。其解决方案的关键在于将地形生成、环境对象放置与自主代理驱动的实时评估统一整合进同一个运行时循环中:通过受波函数坍缩(Wave Function Collapse, WFC)启发的约束驱动机制进行对象布局,并利用两个自治评估代理——空中扫描器(基于射线投射和体积物理扫掠)与地面遍历代理(从导航角度验证路径)——在玩家到达前主动检测潜在问题区域;该设计摒弃了传统离线评估模式,实现了生成与验证的一体化闭环,同时构建了可量化的评估框架,涵盖可玩性、多样性、可控性和运行时性能等PCG核心维度。
链接: https://arxiv.org/abs/2605.01783
作者: Rishabh Kar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 2 figures
Abstract:Procedural Content Generation (PCG) enables game content to be created algorithmically without direct manual level-design effort, but it introduces a serious evaluation problem: generated content may become unbalanced, blocked, repetitive, or technically unsolvable. This paper presents Momentum, an endless-runner game that integrates runtime terrain generation, environment object spawning, and autonomous agent-based evaluation into a single gameplay loop. Ground tiles and environmental objects are generated dynamically as the player advances, object placement follows a constraint-driven mechanism inspired by Wave Function Collapse (WFC), and the runtime navigation surface is rebuilt asynchronously to remain consistent with the streamed environment. Two autonomous evaluation agents move ahead of the player and inspect the generated path: an aerial scanner that examines the corridor geometrically, and a ground-traversal agent that validates the same region from a navigational perspective. The evaluation pipeline combines ray casting, volumetric physics sweeps, obstacle-layer filtering, and structured crash reporting to identify problematic generated scenarios before they reach the player. The work demonstrates how generation and validation can be unified within the same runtime loop, rather than treating evaluation as a separate offline pass. Around this implementation, the paper formulates a measurable evaluation framework along the canonical PCG axes of playability, diversity, controllability, and runtime performance, derives a structural saturation bound on the spawner from its own placement constraints, and quantifies the per-segment scanning cost of the agents from first principles.
[AI-110] Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
【速读】:该论文旨在解决大规模多模态模型驱动的多智能体系统(Multi-Agent Systems, MASs)中因“传染性越狱”(infectious jailbreak)导致的安全漏洞问题,即单个智能体被攻破后可迅速扩散至其他智能体,造成系统性失效。现有防御方法依赖全局共享的“治愈因子”(cure factor)来抑制病毒对抗样本(VirAEs),但这种同质化策略仅能表面压制感染,无法实现真正意义上的恢复。本文的关键解决方案是提出一种无需训练的前瞻性引导局部净化(Foresight-Guided Local Purification, FLP)框架:每个智能体通过模拟未来多轮交互的行为轨迹,结合多人格仿真策略以捕捉交互场景多样性,并利用响应多样性作为诊断信号,在检索结果和语义层面识别异常行为;对于感染智能体,采用局部净化机制——短期感染通过即时图像相册回滚处理,长期感染则使用递归二分诊断(Recursive Binary Diagnosis, RBD)定位并清除VirAEs。该方法显著降低最大累积感染率(从>95%降至<5.47%),同时保持交互多样性与良性基线一致。
链接: https://arxiv.org/abs/2605.01758
作者: Yue Ma,Ziyuan Yang,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
[AI-111] Are LLM s More Skeptical of Entertainment News?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动化新闻可信度评估中是否存在对不同新闻体裁(如硬新闻与娱乐新闻)的非均衡判别标准问题,即是否更倾向于将合法的娱乐新闻误判为虚假信息。其关键发现在于:尽管整体准确率可能较高,但部分LLM(如DeepSeek-V3.2和GPT-5.2)表现出显著的体裁不对称性——对娱乐新闻的假阳性率明显高于硬新闻(差异达8.8–10.1个百分点,p < .001),且这种偏差并非仅由文本风格导致;进一步实验表明,通过特定提示工程(prompt-based mitigation)可缓解部分模型的错误倾向(例如将DeepSeek-V3.2的假阳性降低约50%),但效果具有模型特异性,无法通用化。因此,论文强调评估应引入按体裁分层的假阳性分析,以揭示隐藏在聚合指标下的系统性偏见。
链接: https://arxiv.org/abs/2605.01727
作者: Huiqian Lai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), co-located with ICWSM 2026, May 26, 2026, Los Angeles, CA, USA
Abstract:Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both p .001 ), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.
[AI-112] Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 服务中因请求路由策略不透明而导致的信任危机问题。在实际部署中,AI 服务通常通过版本别名、服务层级、工具选择、区域端点、回退规则或安全处理等多层路由机制响应请求,这些路径虽提升了系统的可扩展性与效率,但若缺乏对用户可见的运行时记录,可能导致成本、质量或责任归属的变化无法被察觉,从而损害可信度。论文提出的关键解决方案是引入“路由凭证”(route receipt)这一运行时透明性元数据结构——它是一个紧凑的记录,能够捕获足以供依赖输出的用户重构关键路由决策的信息,同时避免暴露专有内部逻辑或隐藏推理过程。该方案将路由透明性纳入模型文档体系,与描述训练后模型特性的“模型卡”(model card)形成互补,实现从静态模型描述到动态运行环境追踪的完整可审计链条。
链接: https://arxiv.org/abs/2605.01710
作者: Vincent Schmalbach
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 30 pages
Abstract:AI products often route requests through version aliases, service tiers, tool choices, regional endpoints, fallback rules, or safety handling before responding. These routing steps are documented product surfaces in several widely used AI platforms and serving stacks. Routing helps AI services stay affordable, fast, and available at scale, and it shapes trust. Trust can break when routing changes the cost, quality, or accountability of a response without the user being able to tell what happened. “Which model answered?” is only part of the audit question. The runtime path matters. Adaptive AI systems should produce a runtime transparency artifact called the route receipt. A route receipt is a compact record of the route that served a request. It should capture enough material facts for people relying on the output to reconstruct important routing decisions without exposing proprietary internals or hidden reasoning. Route transparency should be part of model documentation. Model cards describe trained model artifacts, while route receipts describe the runtime conditions under which a particular answer was produced. The paper introduces the route-receipt concept, a minimal schema and redaction model, and a documentation-based survey of selected platforms showing that receipt fragments already exist without a portable per-answer record. Comments: 30 pages Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2605.01710 [cs.AI] (or arXiv:2605.01710v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vincent Schmalbach [view email] [v1] Sun, 3 May 2026 04:49:30 UTC (26 KB)
[AI-113] SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)服务中因预填充(prefill)与解码(decode)阶段分离导致的KV缓存(KV-cache)传输瓶颈问题。在分布式部署下,预填充阶段生成的KV缓存需跨物理节点传输至解码阶段,而传统无损压缩算法在在线场景中压缩速度慢、依赖CPU或不适用于浮点数特征,难以满足低延迟要求。解决方案的关键在于提出SplitZip——一种面向GPU加速的无损压缩方法,其核心创新是利用KV激活张量中浮点数指数部分的冗余性:通过离线校准的Top-16指数码本实现快速固定长度编码,并将高频指数值和稀疏的(位置, 值)对分别编码到主路径和逃逸流(escape stream),从而在GPU上实现高效且高吞吐的压缩与解压,实测压缩速率达613.3 GB/s,解压速率达2181.8 GB/s,显著优于现有方案,在端到端传输中带来最高1.32倍加速。
链接: https://arxiv.org/abs/2605.01708
作者: Yipin Guo,Siddharth Joshi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Contemporary systems serving large language models (LLMs) have adopted prefill-decode disaggregation to better load-balance between the compute-bound prefill phase and the memory-bound decode phase. Under this design, prefill workers generate a KV cache that must be transferred to decode workers before token generation can begin. With these workers residing on different physical systems, this transfer becomes a significant bottleneck to serving LLMs at scale. This bottleneck gets exacerbated for long-input and agentic workloads, which typically require long inputs. Existing lossless codecs are not well suited to this setting as they primarily target offline weight compression, rely on CPU-side, or use variable-length coding that decompresses fast but compresses too slowly for online use. SplitZip is a GPU-friendly lossless compressor for KV-cache transfer. It exploits redundancy in floating-point exponents of KV activations, encoding the most frequent exponent values with fixed-length codes, and encoding (position, value) pairs and value of rare exponents in an escape stream. An offline calibrated top-16 exponent codebook enables online encoding, while the regular dense path and sparse escape correction make both encoding and decoding efficient on GPUs. On real BF16 activation tensors, SplitZip achieves 613.3 GB/s compression throughput and 2181.8 GB/s decompression throughput, substantially outperforming prior lossless compressors on the latency-critical codec path. End-to-end transfer experiments show up to 1.32 \times speedup for BF16 KV-cache transfer, 1.30 \times speedup for TTFT and 1.23 \times increase on Request Throughput.
[AI-114] Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在行为级遗忘(behavioural unlearning)后仍存在可被对抗性探测器恢复的内部记忆痕迹问题。其核心挑战在于如何在不显著损害模型能力的前提下,精准移除这些残留的记忆信号。解决方案的关键在于提出一种基于探针几何对齐(Probe-Geometry Alignment, PGA)的外科式擦除方法:通过在每一层激活空间中沿探测方向对齐激活向量,实现对跨序列记忆签名的精确干预。该方法可在所有测试尺度上将交叉序列探测分数降至随机水平以下,并且对六种对抗性探测变体保持鲁棒性;进一步地,即使面对重新拟合攻击者(re-fitting attacker),PGA 也能在每个相关深度有效防御,同时维持五项零样本能力基准任务的性能损失不超过2.8个百分点(平均Δacc = +0.2pp)。这一成果表明,记忆签名是预训练表示中因果可分离、具有特定表征模式的属性,可通过每层单秩干预彻底清除而几乎不影响模型功能。
链接: https://arxiv.org/abs/2605.01699
作者: Anamika Paul Rupa,Anietie Andy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall – projecting it out collapses the signature locally (+0.44 - -0.19) while behavioural recall barely changes – and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe’s live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean \Deltaacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations – removable below chance with a single rank-one intervention per depth at no measurable capability cost.
[AI-115] Latent State Design for World Models under Sufficiency Constraints
【速读】:该论文旨在解决如何设计有效的世界模型(world model)以支持智能体在复杂任务中的决策与行动问题。传统研究常基于架构或应用领域对方法进行分类,但这种方法忽略了不同任务对潜在状态(latent state)功能需求的差异。论文的核心贡献在于提出一种基于功能的角色分类法(functional taxonomy),将世界模型的潜在状态划分为六类角色:预测嵌入(predictive embedding)、递归信念状态(recurrent belief state)、对象/因果结构(object/causal structure)、潜在动作接口(latent action interface)、具身规划接口(grounded planning interface)和记忆基质(memory substrate)。这一分类揭示了如预测充分性与控制充分性之间的区别,以及被动视频预测与反事实动作建模之间的本质差异。解决方案的关键在于:通过评估潜在状态是否满足特定任务的“充分性约束”(sufficiency constraint),来判断其有效性,而非单纯追求信息保留量的最大化;由此构建的多维评估框架可诊断模型在表征、预测、规划、可控性、因果推理、记忆和不确定性处理等方面的能力,最终得出结论——一个可操作的世界模型,其状态构造必须与具体任务相匹配,而非盲目追求信息完整性。
链接: https://arxiv.org/abs/2605.01694
作者: Keon Woo Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A world model matters to an agent only through the state it constructs. That state must preserve some information, discard other information, and support some future function: prediction, control, planning, memory, grounding, or counterfactual reasoning. This paper treats world-model research as latent state design under sufficiency constraints. We propose a functional taxonomy that groups methods by what their latent state is for, rather than by architecture or application domain: predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate. These roles expose distinctions that architecture-based groupings hide, including the gap between predictive sufficiency and control sufficiency, and the gap between passive video prediction and counterfactual action modeling. The taxonomy supports an evaluation framework that judges a model by the sufficiency constraint its latent state was built to satisfy. We compare methods along seven axes: representation, prediction, planning, controllability, causal/counterfactual support, memory, and uncertainty. We use the resulting matrix as a diagnostic for what a latent state preserves, discards, and enables. The conclusion that follows is that an actionable world model is the one whose state construction matches the task, not the one that preserves the most information. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01694 [cs.AI] (or arXiv:2605.01694v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01694 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Keon Kim [view email] [v1] Sun, 3 May 2026 03:19:42 UTC (45 KB)
[AI-116] Class-Aware Adaptive Differential Privacy in Deep Learning for Sensor-Based Fall Detection
【速读】:该论文旨在解决基于传感器的跌倒检测(fall detection)中隐私保护与模型性能之间的矛盾问题。现有差分隐私(differential privacy, DP)方法对所有训练样本统一添加噪声,导致预测性能显著下降。其解决方案的关键在于提出一种类感知自适应差分隐私(Class-Aware Adaptive Differential Privacy, CA-ADP)框架,该框架根据每个小批量数据(mini-batch)中的类别分布动态调整梯度噪声强度,在保障 (\epsilon,\delta)-差分隐私的前提下有效缓解性能损失。结合3D卷积神经网络与双向长短期记忆网络(3D CNN-BiLSTM)架构,实验表明该方法在SisFall、UP-Fall和MobiAct三个公开数据集上分别提升了3.3%、8.5%和7.5%的F-score,且通过Wilcoxon符号秩检验验证了其一致性优势,为实际医疗场景下的隐私保护跌倒检测提供了兼具有效性与形式化隐私保障的新范式。
链接: https://arxiv.org/abs/2605.01679
作者: Joydeb Kumar Sana
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fall detection is a critical task in healthcare, particularly for elderly people. Timely fall detection and treatment can prevent severe injuries. Sensor-based activity data can be used to detect fall. However, this data are highly sensitive and raises significant privacy concerns. Existing privacy approaches apply uniform noise across all training samples, which affects the prediction performance. To address this limitation, we propose a Class-Aware Adaptive Differential Privacy (CA-ADP) framework integrated with a hybrid 3D Convolutional Neural Network and Bidirectional Long Short-Term Memory (3D CNN-BiLSTM) architecture. The CA-ADP mechanism dynamically adjusts the magnitude of noise added to gradients based on the class composition of each mini-batch. This process ensures privacy while mitigates performance degradation. We formally analyze the (\epsilon,\delta) -Differential Privacy guarantee and provide a privacy-utility trade-off analysis. The proposed method is evaluated on three public benchmark datasets, namely SisFall, UP-Fall, and MobiAct. The experimental results show that the proposed privacy model achieves improvements of 3.3%, 8.5%, and 7.5% over the conventional privacy-based model in terms of F-score for the SisFall, UP-Fall, and MobiAct datasets, respectively. Comparisons with prior studies show that the CA-AD based framework achieves competitive performance and provides formal privacy guarantees, which are largely overlooked in existing studies. Wilcoxon signed-rank tests confirm that the proposed mechanism consistently outperforms conventional differential privacy. Those results establish the proposed CA-ADP framework as an effective approach to privacy-preserving fall detection in real-world healthcare settings.
[AI-117] AI Alignment via Incentives and Correction
【速读】:该论文旨在解决生成式 AI(Generative AI)中的对齐(alignment)问题,即如何设计激励机制以确保智能体在执行任务时的行为符合人类意图。传统方法常将偏差视为系统性故障,而本文提出从法律经济学中的威慑与执行模型出发,将对齐建模为一个双层优化问题:在solver-auditor结构中,solver可能因收益驱动而产生错误或误导性输出,审计者则需权衡监督成本与收益;奖励设计必须同时影响solver行为和审计者监督意愿,形成稳定的均衡状态。解决方案的关键在于引入基于bandit算法的外层搜索过程,利用交互反馈动态调整奖励配置,从而维持有效的监督压力并提升主体对齐结果,实验表明该方法显著减少了大语言模型(LLM)代码流水线中的幻觉性错误尝试。
链接: https://arxiv.org/abs/2605.01643
作者: Rohit Agarwal,Joshua Lin,Mark Braverman,Elad Hazan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor’s incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01643 [cs.LG] (or arXiv:2605.01643v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01643 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-118] From Packets to Patterns: Interpreting Encrypted Network Traffic as Longitudinal Behavioral Signals
【速读】:该论文旨在解决如何通过被动感知方式(即无需用户主动参与)在大规模场景下持续捕捉与睡眠障碍、压力和孤独感相关的个体行为模式的问题。其核心挑战在于,人类行为难以直接观测,但可通过智能手机网络流量等隐式数据源进行间接推断。解决方案的关键在于:首先采用基于Transformer架构并结合每个用户的适配器(per-user adapters)的模型结构,以同时建模个体间共有的行为结构和个体内部的动态变化;其次,利用稀疏自编码器(sparse autoencoder)提取可解释的行为特征,从而识别出与特定活动模式对应的行为特征;最后,通过广义估计方程(generalized estimating equations)结合Mundlak分解,区分个体间差异与个体内部随时间的变化,揭示不同心理状态的时序特性。该方法不仅验证了加密网络流量作为被动传感模态的有效性,还证明了学习得到的表征优于预定义的网络特征,尤其能捕捉到个体偏离基线的行为动态。
链接: https://arxiv.org/abs/2605.01616
作者: Rameen Mahmood,Omar El Shahawy,Souptik Barua,Zachary Beattie,Jeffrey Kaye,Xuhai "Orson’’ Xu,Danny Yuxing Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 6 figures
Abstract:Human behavior is difficult to observe continuously at scale, yet it leaves measurable traces in everyday device use. We test whether encrypted smartphone network traffic – a ubiquitous, always-on, passive sensing modality – can passively capture behavioral patterns related to sleep, stress, and loneliness. We model shared behavioral structure using a transformer backbone with per-user adapters, allowing the model to represent both typical individual behavior and deviations from it. To make these representations interpretable, we apply a sparse autoencoder to extract behavioral features corresponding to distinct patterns of activity. We relate these features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, separating between-person differences from within-person changes over time. We find that the three outcomes reflect distinct temporal structures: stress is primarily associated with stable between-person differences, loneliness with within-person variation, and sleep disturbance with a combination of both. Notably, these within-person dynamics are not captured by predefined network-traffic features, demonstrating the value of learned representations for longitudinal behavioral sensing. These results establish encrypted network traffic as a viable passive sensing modality, revealing interpretable behavioral dynamics – particularly deviations from an individual’s baseline – that are not visible in raw traffic features.
[AI-119] he Case for ESM3 as a General-Purpose AI Model with Systemic Risk Under the EU AI Act
【速读】:该论文旨在解决欧盟人工智能法案(EU AI Act)中关于前沿生物基础模型(如ESM3)是否应被纳入具有系统性风险的通用人工智能模型监管范围的问题。其核心问题在于,当前法规对“系统性风险”定义模糊,导致像ESM3这类生成式生物模型可能未被有效覆盖,从而存在潜在双重用途风险(dual-use risks)。解决方案的关键在于:首先通过映射ESM3在生物风险链中的角色,论证其应受相关义务约束;其次基于AI法案及其补充材料的分类标准进行属性比对,发现ESM3目前尚未被明确规制;最终提出修正措施,建议将此类生物基础模型纳入监管框架,要求其开展风险评估与缓解,以填补现有法律空白。
链接: https://arxiv.org/abs/2605.01611
作者: Taro Qureshi,Jacob Griffith,Koen Holtman,Marcel Mir Teijeiro,Ze Shen Chin,Rokas Gipiškis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 1 figure, Technical AI Safety Conference
Abstract:Due to ambiguity in the wording of the EU AI Act, we examine the question of to what extent frontier biological foundation models such as ESM3 are subject to obligations for general-purpose AI models with systemic risk under the EU AI Act. In this paper, we map ESM3 to the biorisk chain, and conclude that it would be desirable if the providers of ESM3 and similar biological models were subject to these obligations, which would require them to assess and mitigate dual-use risks from their models. We then perform an analysis, comparing the attributes of ESM3 to the classification criteria in the AI Act and the supporting material. We conclude that at this time, ESM3 does not appear to be meaningfully regulated by the Act. We then propose remedies to correct the situation.
[AI-120] Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations
【速读】:该论文旨在解决跨语言概念迁移(cross-lingual concept transport)问题,核心在于验证由 \citetpark2024linear 提出的因果内积(causal inner product)是否能有效实现不同语言间语义概念的对齐与迁移。其解决方案的关键在于引入“白化因果对齐”(Whitened Causal Alignment),该方法基于未嵌入协方差矩阵 Σ 构建正交化空间,并通过匹配谱随机化检验评估其有效性。实验发现,白化因果对齐与仅使用谱正则化无显著差异(p=0.95),但这一失败揭示了更深层现象:激活空间中的概念方向在谱尾区域呈现反集中(anti-concentration)特征,而静态的未嵌入行对比则集中在高方差方向(p<10−4)。进一步的分裂注入干预和词性标注探测表明,语法信息优先编码于高方差子空间,且Transformer可能在上下文化处理中将语义内容旋转至谱安静区(spectrally quiet regions),从而实现低语法扰动下的概念操控。
链接: https://arxiv.org/abs/2605.01609
作者: Pratyush Acharya,Nuraj Rimal,Habish Dhakal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 16 figures, 13 tables
Abstract:We test whether the causal inner product of \citetpark2024linear – defined by the unembedding covariance \Sigma – enables cross-lingual concept transport. Across 17 models and 4 language pairs, a matched-spectrum randomization test finds that Whitened Causal Alignment is indistinguishable from spectral regularization alone ( p = 0.95 ). However, this failure reveals a broader phenomenon: anti-concentration is observed in residual-stream difference-of-means vectors across five architecture families ( p 10^-33 ) and supported by SAE features (e.g., p = 4.5 \times 10^-19 ) and linear probes on Gemma and Llama. We discover a \emphdual geometry: activation-space concept directions anti-concentrate in the spectral tail, while static unembedding-row contrasts \emphconcentrate in high-variance directions ( p 10^-4 ). Split-injection causal interventions support the functional basis on Gemma and Llama (Cohen’s d up to 1.80 ), and POS-tag probing across 8 models shows syntax preferentially encodes in the high-variance subspace in 6 of 8 architectures ( p 0.013 ), with the Qwen~2.5 family showing a significant reversal consistent with architecture-specific spectral structure. These results suggest transformers may rotate semantic content into spectrally quiet regions during contextualized processing, encoding concepts where they can be manipulated with reduced grammatical disruption.
[AI-121] Evaluating Agent ic AI in the Wild: Failure Modes Drift Patterns and a Production Evaluation Framework
【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)评估框架在生产环境中对智能体系统(Agentic AI Systems)失效模式检测不足的问题。传统评估方法如HELM、MT-Bench、AgentBench和BIG-bench主要适用于受控的单次实验场景,无法捕捉持续运行中产生的复合决策错误、工具故障级联、非确定性输出漂移以及长周期任务缺乏真实标签等挑战。解决方案的关键在于提出PAEF(Production Agentic Evaluation Framework),这是一个五维评估框架,专为连续监控生产流量设计,并配备开源参考实现;其核心创新是识别出七类独特的生产级失效模式,并证明标准指标(如ROUGE、BERTScore、准确率/AUC及各类代理基准)在多数情况下无法及时或完全检测这些失效,从而推动评估范式从离线基准测试向实时、持续评估转变。
链接: https://arxiv.org/abs/2605.01604
作者: Mukund Pandey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 tables, 1 figure. Reference implementation: this https URL
Abstract:Existing evaluation frameworks for large language models – including HELM, MT-Bench, AgentBench, and BIG-bench – are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics – ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above – fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
[AI-122] Model Merging: Foundations and Algorithms
【速读】:该论文旨在解决传统深度学习中模型作为独立个体被训练、专用且易被替换的问题,提出以模型合并(model merging)为替代范式,通过直接在权重空间中融合多个独立训练的神经网络,实现能力的组合、复用与扩展。其核心解决方案包括:在单任务场景下,提出基于Frank-Wolfe优化的循环一致性合并算法C²M³,使多模型对齐至无参考的共享参数空间,从而赋予权重平均以语义意义;在多任务场景下,首次从理论层面将任务向量(task vector)建模为近似梯度,并揭示其低秩结构,进而引入任务奇异向量(Task Singular Vectors, TSV),用于压缩和减少干扰;进一步设计输入自适应路由方法MASS,利用TSV几何结构在推理时选择相关子空间;最终构建进化式合并框架MERGE³,结合项目反应理论(Item Response Theory)将评估成本降低达50倍,同时保持解的质量。这些工作共同奠定了模型合并的理论与算法基础。
链接: https://arxiv.org/abs/2605.01580
作者: Donato Crisostomi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD thesis
Abstract:Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C ^2 M ^3 , a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C ^2 M ^3 aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE ^3 , an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50 \times while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models. Comments: PhD thesis Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01580 [cs.LG] (or arXiv:2605.01580v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01580 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Donato Crisostomi [view email] [v1] Sat, 2 May 2026 19:06:35 UTC (43,947 KB)
[AI-123] Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling ACL2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在推理阶段如何实现高准确率的同时优化计算资源使用效率的问题。当前许多推理增强方法虽能提升模型预测性能,但往往忽视了计算成本,难以满足实际部署中的资源约束。解决方案的关键在于系统性地分析四种推理缩放策略——自洽性(self-consistency)、自精炼(self-refinement)、多智能体辩论(multi-agent debate)和混合智能体(mixture-of-agents)的计算性能权衡,并通过在MMLU-Pro和BBH两个推理基准上进行大规模参数配置实验(共34种配置、超100次评估),构建帕累托最优前沿以识别在给定计算预算下最具性价比的方法。研究发现,在相同计算预算下,辩论和混合智能体分别比自洽性提升1.3%和2.7%准确率;且多智能体策略在复杂任务中持续增益,而自洽性则更早饱和。进一步提出一个简洁的设计准则:当并行生成数量超过串行聚合次数时,混合智能体策略最为高效。
链接: https://arxiv.org/abs/2605.01566
作者: Florian Valentin Wunderlich,Lars Benedikt Kaesberg,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at SRW at ACL 2026, long paper
Abstract:Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self-consistency, self-refinement, multi-agent debate, and mixture-of-agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably, inference scaling improves accuracy by up to +7.1% points over chain-of-thought at the highest evaluated budgets (20x the CoT compute budget) on MMLU-Pro. With an equal computing budget, debate and mixture-of-agents outperform self-consistency by 1.3% and 2.7% points, respectively. While self-consistency saturates earlier, multi-agent gains persist, particularly on more complicated tasks. We identify a simple multi-agent design guideline: mixture-of-agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.
[AI-124] Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse
【速读】:该论文旨在解决传统需求复用框架(如OOMRAM)因依赖精确标识符匹配和固定模板而导致的适应性差的问题,同时克服大型语言模型(LLM)在生成需求时易产生结构无效或不一致组合的风险。其解决方案的关键在于提出一种神经符号多智能体系统,将需求复用重构为基于模型的需求获取(Model-Driven Elicitation)过程:其中LLM作为非确定性启发式策略用于遍历由形式化OOMRAM需求格(requirement lattice)表示的确定性领域模型,而一个确定性的符号验证器则在智能体循环中强制执行所有结构约束,从而从构造上消除幻觉式需求组合。该方法在两个应用族的自主基准测试中实现了100%的需求覆盖率和仅0.2%的约束违反率,确保生成规格始终结构有效并满足所有强制性领域要求。
链接: https://arxiv.org/abs/2605.01562
作者: Ahmed Ibrahim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The Object-Oriented Method for Requirements Authoring and Management (OOMRAM) is a requirements reuse framework that relies on exact identifier matching and rigid templates, limiting its ability to adapt specifications across diverse contexts. While Large Language Models (LLMs) offer the flexibility to overcome this bottleneck, they introduce the risk of generating structurally invalid or inconsistent requirement combinations. To address this tension, we present a neuro-symbolic multi-agent system that re-conceptualizes requirements reuse as a \textbfModel-Driven Elicitation process. In this paradigm, an LLM serves as a \textbfnon-deterministic heuristic for traversing a \textbfdeterministic domain model represented by a formal OOMRAM requirement lattice. A deterministic, symbolic validator enforces all structural constraints within the agent loop, effectively eliminating hallucinated requirement combinations by construction. Evaluated on an autonomous benchmark across two application families, our system achieves 100% requirement coverage and a constraint-violation rate of only 0.2%. Although the F1-score against a single gold standard is moderate (0.47–0.51), every generated specification is structurally valid and satisfies all mandatory domain requirements. The model-agnostic implementation scales to larger lattices via subgraph navigation and provides transparent audit trails for regulatory compliance.
[AI-125] 6G Needs Agents : Toward Agent ic AI-Native Networks for Autonomous Intelligence
【速读】:该论文旨在解决当前第六代移动通信(6G)网络架构中因过度依赖优化导向的封闭回路控制而缺乏自主推理能力的问题,提出从传统优化范式向“智能体驱动的AI原生6G”(Agentic AI-Native 6G)范式的转变。其核心解决方案是构建一个四层架构:在确定性3GPP基础设施之上引入语义控制平面,通过基于大语言模型(Large Language Model, LLM)的智能体作为受策略约束的推理实体,在设备、边缘与核心域之间形成分布式多智能体协同机制;并通过实证研究发现,单一模型难以同时满足延迟、吞吐量和准确率要求,必须采用跨层级异构部署策略,并强调系统级优化(如量化带来的非均匀影响)的重要性,从而为实现可扩展、可信且具备自推理能力的未来6G网络提供可行路径。
链接: https://arxiv.org/abs/2605.01546
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Sixth-generation (6G) networks are increasingly envisioned as AI-native infrastructures integrating communication, sensing, and computing into a unified fabric. However, existing approaches remain largely optimization-centric, relying on closed-loop control with limited reasoning capability. In this paper, we argue for a paradigm shift toward Agentic AI-Native 6G, in which Large Language Model (LLM)-based agents operate as bounded, policy-governed reasoning entities within a semantic control plane layered above deterministic 3GPP infrastructure. We propose a four-layer architecture that integrates deterministic network infrastructure, semantic abstraction of intent and context, hierarchical reasoning, and a distributed multi-agent fabric spanning device, edge, and core domains. To assess feasibility, we develop a proof-of-concept agentic reasoning and orchestration framework and conduct an extensive empirical study using a domain-specific 6G benchmark under realistic deployment constraints. Our results reveal a fundamental tradeoff between reasoning capability and system efficiency, showing that no single model simultaneously satisfies latency, throughput, and accuracy requirements. Instead, heterogeneous deployment of LLM agents across the device–edge–core continuum is necessary to balance these constraints. We further demonstrate that quantization introduces non-uniform effects across models, reinforcing the need for system-level optimization rather than model-level compression alone. These findings establish agentic intelligence as a viable architectural direction for 6G and highlight key challenges in achieving scalable, trustworthy, and self-reasoning networks. All experimental results and evaluation scripts are publicly available to support reproducibility.
[AI-126] Mesh Based Simulations with Spatial and Temporal awareness
【速读】:该论文旨在解决当前基于机器学习的计算流体力学(Computational Fluid Dynamics, CFD)代理模型在训练范式上的局限性问题,特别是现有方法仍沿用节点级监督和显式欧拉时间步进等过时假设,未能充分考虑偏微分方程数值求解中固有的刚性动力学和局部通量连续性特征。解决方案的关键在于提出一个统一框架,融合几何深度学习与严谨数值分析:(1) 多节点预测(Multi Node Prediction),通过 stencil 级目标强制空间导数一致性;(2) 时间校正(Temporal Correction),以时间交叉注意力机制替代不稳定的显式格式,引入预测-校正策略;(3) 几何归纳偏置(Geometric Inductive Biases),利用三维旋转位置编码(3D Rotary Positional Embeddings, RoPE)捕捉非结构化网格中的旋转对称性。该框架显著提升了长期滚动预测的精度与稳定性,并实现了对未见子任务(如壁面剪切应力或压力预测)的良好泛化能力。
链接: https://arxiv.org/abs/2605.01542
作者: Paul Garnier,Vincent Lannelongue,Elie Hachem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Machine Learning surrogates for Computational Fluid Dynamics (CFD), particularly Graph Neural Networks (GNNs) and Transformers, have become a new important approach for accelerating physics simulations. However, we identify a critical bottleneck in the field: while architectures have advanced significantly, the common underlying training paradigms remain bound to naive assumptions, such as node-wise supervision and explicit Euler time-stepping. These legacy choices ignore the stiff dynamics and local flux continuity inherent to numerous partial differential equations resolution methods, such as Finite Element, Difference, or Volume (FEM). In this work, we propose a unified framework to bridge the gap between geometric deep learning and rigorous numerical analysis. We introduce three key innovations: (1) Multi Node Prediction, a stencil-level objective that predicts field values for a node’s full local topology, enforcing spatial derivative consistency; (2) Temporal Correction, replacing unstable explicit schemes with a predictor-corrector via temporal Cross-Attention; and (3) Geometric Inductive Biases, leveraging 3D Rotary Positional Embeddings (RoPE) to robustly capture rotational symmetries in unstructured meshes. We evaluate this framework across three architectures (MeshGraphNet, Transolver, and a Transformer) on diverse physics datasets. Our approach yields consistent improvements in accuracy and stability, particularly in long-horizon rollouts, while producing latent representations that generalize to unseen subtasks such as Wall Shear Stress or Pressure prediction. Code is available at this https URL.
[AI-127] Protein-Conditioned Multi-Objective Reinforcement Learning for Full-Length mRNA Design
【速读】:该论文旨在解决治疗性mRNA(therapeutic mRNA)设计中如何平衡稳定性、翻译效率和免疫安全性的问题,传统方法难以同时优化多个生物目标。其解决方案的关键在于提出ProMORNA框架,该框架基于超过600万对天然蛋白-mRNA数据训练了一个BART风格的编码器-解码器模型,并引入多目标组相对策略优化(Multi-Objective Group Relative Policy Optimization, MO-GRPO),在统一框架下协同优化多个生物学目标。通过在未见靶标(如萤火虫荧光素酶)上的验证,ProMORNA在预测半衰期与翻译效率的Pareto前沿上优于标准监督基线,且功能评分更高,证明了多目标强化学习在全长度mRNA从头设计中的可行性。
链接: https://arxiv.org/abs/2605.01513
作者: Zixi Shao,Tao Wang,Yibei Xiao,Tianyi Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Designing therapeutic messenger RNA (mRNA) requires creating full-length transcripts that carefully balance stability, translation efficiency, and immune safety. To address this challenge, we propose ProMORNA, a multi-objective generation framework that produces complete mRNA transcripts \textitde novo directly from a target protein sequence. Our approach begins by training a BART-style encoder-decoder model on over 6 million natural protein-mRNA pairs. We then introduce Multi-Objective Group Relative Policy Optimization (MO-GRPO) to simultaneously optimize for various biological objectives in a unified way. As a case study, we evaluated ProMORNA on the widely used firefly luciferase target, excluding it from both our supervised training data and the prompt pool. The results indicate that ProMORNA improves the \textitin silico Pareto frontier for predicted half-life and translation efficiency relative to standard supervised baselines. Additionally, it achieves higher predicted functional scores than a state-of-the-art baseline under the same evaluation pipeline. These computational findings demonstrate the feasibility of using multi-objective reinforcement learning for full-length mRNA design on unseen targets.
[AI-128] MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration
【速读】:该论文旨在解决部分自动驾驶(Partial Driving Automation)导致人类驾驶员认知负荷增加的问题,其根源在于人车之间缺乏透明的意图理解与决策逻辑共享,以及自动化系统对驾驶员动态状态和偏好感知不足,进而造成协同感知缺失与交互协调失败。解决方案的关键在于提出“环中中介驾驶”(Mediator-in-the-Loop-Driving, MILD)系统,该系统基于代理架构(Agentic System Architecture),将人类角色从被动监督者转变为积极管理者,并通过感知代理(Perception Agent)实现舱内与舱外联合理解,结合轻量级策略代理(Strategy Agent)生成合规且可解释的动作建议;同时引入证据与约束加权策略优化(Evidence- and Constraint-weighted Policy Optimization, ECPO),利用自动验证器引导代理行为在准确性、结构完整性、证据支持性和约束合规性方面达到最优,并通过检索增强生成模块动态整合交通法规、限速建议及驾驶员偏好,从而实现可审计、对齐人类价值观的智能协同驾驶。
链接: https://arxiv.org/abs/2605.01507
作者: Jiyao Wang,Yunbiao Wang,Yubo Jiao,Xiao Yang,Dengbo He,Sasan Jafarnejad,Luis Miranda-Moreno,Raphael Frank,Jiangbo Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prior studies report that partial driving automation can increase the cognitive demands on human drivers. This effect largely arises from human drivers’ lack of transparent insight into the vehicle’s intentions and decision logic, as well as from automated systems’ limited awareness of the driver’s dynamic state and preferences. This bidirectional misalignment undermines shared situational awareness and exacerbates coordination failures in human-vehicle interaction. To address these limitations, we argue for a paradigm shift that elevates the human role from passive supervisor to active manager. We introduce the Mediator-in-the-Loop-Driving (MILD) system, based on an agentic system architecture to facilitate synergistic human-vehicle collaboration. MILD integrates a perception agent for joint in-cabin and out-of-cabin understanding with a lightweight strategy agent that generates compliant and explainable action suggestions. To ensure these strategies are strictly aligned with safety regulations and human values, we develop Evidence- and Constraint-weighted Policy Optimization (ECPO). ECPO leverages automatic validators to steer the agent toward behaviors that are not only accurate but also structurally complete, substantiated by evidence, and free from constraint violations. Furthermore, a retrieval-augmented generation module dynamically incorporates constraints from traffic regulations, speed recommendations, and driver preferences into the decision loop. Field experiments across three open datasets demonstrate that MILD consistently outperforms baselines in both perception accuracy and strategy quality under auditable offline metrics, and yields higher human-rated policy adequacy, comfort, and explanation than baselines. This work offers a practical pathway for building auditable and aligned agents for human-vehicle collaborative driving.
[AI-129] MAP-Law: Coverag e-Driven Retrieval Control for Multi-Turn Legal Consultation
【速读】:该论文针对多轮法律咨询中检索控制不精准的问题展开研究,旨在解决现有方法依赖固定检索深度或粗粒度启发式策略导致的两大缺陷:一是关键法律要素支持不足,二是过度检索引发上下文负担加重和答案焦点弱化。其解决方案的核心在于提出MAP-Law框架,通过构建包含问题节点、法律要素节点与证据节点的联合结构化状态空间,将检索过程建模为受控的决策流程;在每轮检索后,Agent基于Element Coverage(要素覆盖率)、Evidence Coverage(证据覆盖率)和Marginal Gain(边际收益)三个可解释指标动态判断是否继续检索、调整方向或生成最终答复,从而将停止决策从固定超参数转化为与法律论证结构对齐的可解释、可审计决策机制。
链接: https://arxiv.org/abs/2605.01486
作者: Qinchuan Cheng,Ruixuan Xie,Jiaqi Liu,Xiaoya Yuan,Yuxin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Legal consultation is a high-stakes, knowledge-intensive task that requires agents to identify relevant legal issues, retrieve authoritative support, and determine when evidence is sufficient for a recommendation. Although retrieval-augmented generation has improved grounding in legal question answering, many multi-turn legal agents still rely on fixed retrieval depth or coarse heuristic control. This often leads to either insufficient support for key legal elements or excessive retrieval that increases context burden and weakens answer focus. We propose MAP-Law, a coverage-driven framework for retrieval control in multi-turn legal consultation. MAP-Law models consultation as a controlled retrieval process over a joint structured state consisting of issue nodes, legal element nodes, and evidence nodes. After each retrieval round, the agent computes Element Coverage, Evidence Coverage, and Marginal Gain, and uses these signals to decide whether to continue retrieval, redirect the search, or generate the final response. In this way, MAP-Law turns stopping from a fixed hyperparameter into an interpretable and auditable decision aligned with legal argumentative structure. Experiments on a self-constructed dataset of 50 cases across eight labor-law scenarios show that MAP-Law with DeepSeek as the action selector achieves an Element Coverage of 0.860 using only 2.9 retrieval rounds and 5.8 evidence pieces on average. Compared with a fixed seven-round baseline, it reduces evidence volume by over 80% and retrieval rounds by 58%. Ablation results further confirm the independent contributions of coverage-driven stopping, joint graph representation, and LLM-based action selection. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01486 [cs.AI] (or arXiv:2605.01486v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-130] Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
【速读】:该论文旨在解决多跳事实验证(Multi-Hop Fact Verification, MHFV)中大型语言模型(Large Language Models, LLMs)因幻觉和逻辑链断裂导致的推理可靠性问题。现有方法虽通过思维链(Chain-of-Thought, CoT)提升透明度,但缺乏对证据与主张之间因果依赖关系的显式建模。解决方案的关键在于引入结构因果模型(Structural Causal Model, SCM),将验证过程视为一个构造性的因果推断任务,并发现推理链长度与准确率呈“倒U型”关系,表明过度结构复杂性会降低性能;为此提出基于规则的强化学习策略,采用组相对策略优化(Group Relative Policy Optimization, GRPO),动态平衡结构深度与简洁性,从而在HoVer和EX-FEVER数据集上显著优于现有最先进基线,提供了一种可靠且可解释的复杂事实验证方法。
链接: https://arxiv.org/abs/2605.01482
作者: Yunhan Bu,Quan Zhang,Huaping Zhang,Guotong Geng,Chunxiao Gao,Askar Hamdulla,Juan Wang,Qiuchi Li,Baohua Zhang,Shuai Lei,Yunbo Cao,Zhunchen Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an “inverted U-shaped” correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
[AI-131] Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM -Driven Discovery and Self-Correction
【速读】:该论文旨在解决大规模企业级应用中UI测试套件维护的可靠性与成本问题,尤其是在高度动态的用户界面(UI)环境中,传统人工驱动的测试方法难以持续保障覆盖率和稳定性。其解决方案的核心在于构建一个基于大语言模型(Large Language Model, LLM)与LangGraph编排、Playwright执行引擎及RAG知识库协同工作的多智能体自主测试系统,通过约束性自治机制实现从初始无目标探索到特征发现、动态覆盖扩展、失败测试自动修复的闭环流程。关键创新在于引入显式约束边界与人类监督机制,避免完全自主带来的语义错误与虚假收敛现象,从而在保持高自动化水平的同时确保测试结果的语义正确性和操作可信度。
链接: https://arxiv.org/abs/2605.01471
作者: Hyukjoo Lee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Industrial case study; submitted for review
Abstract:Maintaining reliable UI test suites in large-scale enterprise applications is a persistent and costly challenge. We present an industrial case study of a multi-agent autonomous testing system evaluated using anonymized execution data from a production-like enterprise UI testing prototype. The application features several hundred dynamic UI elements per screen. Built on a large language model with LangGraph orchestration, Playwright execution, and a RAG knowledge base, the system evolves from human-directed testing toward High-autonomy feature discovery and test execution: given no explicit test targets, it discovers over 100 testable features across 10 UI screens, dynamically expands coverage by an additional 15–30 features through runtime DOM analysis, and iteratively repairs failing tests without human intervention. We analyzed 300 consecutive autonomous execution reports encompassing 636 individual test-case executions across 10 distinct scenario families. The system achieved a 70% repair convergence rate at the scenario-family level, with a mean of 3.4 repair iterations to convergence. However, only 10% of scenario families succeeded on first attempt, 38% of reports failed to produce any executable test artifact, and we documented concrete instances of assertion weakening and test-case deletion used as workaround mechanisms to achieve superficial convergence. Our findings show that unrestricted autonomy leads to unstable and often misleading outcomes, while constrained autonomy transforms such systems into operationally viable workflows. Rather than advocating full autonomy, our findings suggest that reliable autonomous testing in enterprise-scale settings requires explicit constraints, validation boundaries, and human oversight to preserve semantic correctness and operational trustworthiness. Comments: Industrial case study; submitted for review Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01471 [cs.SE] (or arXiv:2605.01471v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.01471 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-132] CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
【速读】:该论文旨在解决离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, MARL)中生成式模型在少步推理(few-step inference)时难以保持智能体间协调性的问题。现有方法要么通过蒸馏联合教师模型到独立学生模型,要么对每个智能体独立应用平均速度场,导致协调能力下降。其解决方案的关键在于提出协同少步流模型(Coordinated few-step Flow, CoFlow),该架构结合了协同速度注意力机制(Coordinated Velocity Attention, CVA)与自适应协调门控机制(Adaptive Coordination Gating),使速度场天然耦合以保留跨智能体的协作关系;同时引入有限差分一致性代理损失,替代高内存消耗的雅可比-向量乘积反向传播,仅需两次停止梯度的前向传播即可实现高效训练。实验表明,CoFlow在60种配置下均达到或超越主流基线模型,在1–3步去噪推理中实现最优协调质量。
链接: https://arxiv.org/abs/2605.01457
作者: Guowei Zou,Haitao Wang,Beiwen Zhang,Boning Zhang,Hejun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 13 figures, 9 tables. Project page: this https URL
Abstract:Generative models have emerged as a major paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step accelerations either distill a joint teacher into independent students or apply averaged velocities independently per agent, suggesting that few-step inference requires sacrificing inter-agent coordination. We show this trade-off is not necessary: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian / value-based, transformer, diffusion, and prior flow baselines on episodic return. Three independent coordination probes confirm that the gains flow through inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution. Project page: this https URL.
[AI-133] VisInject: Disruption != Injection – A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
【速读】:该论文旨在解决当前对齐的多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对通用对抗攻击时,现有评估指标未能区分“模型输出被扰动”与“攻击者指定目标概念被精确注入”这两个本质不同的现象的问题。其关键解决方案是提出一种双轴评估框架:一方面使用Ratcliff-Obershelp漂移分数量化模型响应的程序性扰动程度(Influence),另一方面引入四分类等级(无/弱/部分/确认)衡量攻击目标是否被准确注入(Precise Injection)。实验表明,在 L∞ 为 16/255 的约束下,尽管多数样本显示出显著的程序性扰动(66.4%),但真正实现非零级注入的比例极低(仅0.756%),且完全匹配的精准注入更是罕见(仅0.030%),揭示了视觉模态作为提示注入通道的实际脆弱性远低于传统报告所显示的水平。
链接: https://arxiv.org/abs/2605.01449
作者: Pang Liu,Yingjie Lao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model’s output was perturbed (Influence), and (ii) the attacker’s chosen target concept was actually emitted (Precise Injection). We compose two existing techniques – Universal Adversarial Attack and AnyAttack – under an L_inf budget of 16/255, and we add a dual-axis evaluation: a deterministic Ratcliff-Obershelp drift score for Influence (programmatic baseline) plus a 4-tier ordinal categorical none/weak/partial/confirmed for Precise Injection. The judge is DeepSeek-V4-Pro in thinking mode, calibrated against Claude Opus 4.7 with Cohen’s \kappa = 0.77 on the injection axis (substantial agreement); the entire 4475-entry SHA-256 input cache ships with the dataset so reviewers can re-derive paper numbers bit-exact without an API key. Across 6615 pairs over four open VLMs, seven attack prompts, and seven test images, the two axes diverge by roughly 90 \times : 66.4% of pairs are programmatically disturbed (LLM-judged 46.6% at the substantial-or-complete tier), but only 0.756% (50/6615) reach any non-none injection tier and only 0.030% (2/6615) verbatim. The few injections that do land cluster on screenshot- or document-style carriers whose semantics already invite text transcription. BLIP-2 shows \emphzero detectable drift at L_inf = 16/255 across all 2205 pairs even when used as a Stage-1 surrogate. We release the full dataset – 21 universal images, 147 adversarial photos, 6,615 response pairs, the v3 dual-axis judge results, and the cache at this http URL.
[AI-134] Rethinking Explanations: Formalizing Contrast in Description Logics
【速读】:该论文旨在解决现有描述逻辑(Description Logic, DL)知识库解释方法在用户中心性方面的不足问题,即当前的推理解释形式化(如Justifications和溯因推理)仅能指出某个断言(axiom)成立或缺失的推理路径,却无法考虑用户的认知需求、理解水平或先验知识。为弥补这一缺陷,论文提出**对比解释(contrastive explanations)**作为解决方案,其核心在于回答“为何一个断言P为真,而非另一个相似断言Q”(即“why P instead of Q”),从而揭示两个可能断言之间的差异性。该方法通过定义对比问题的形式基础,并在DL EL和ALC中探索其性质,将对比焦点置于个体(如C(x) vs. C(y))或概念层面(如C(x) vs. D(x)),使解释更具针对性与可理解性,显著提升了对用户实际困惑的响应能力。
链接: https://arxiv.org/abs/2605.01442
作者: Yasir Mahmood,Arnab Sharma,Axel-Cyrille Ngonga Ngomo,Balram Tiwari
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
备注: Pre-print to the paper accepted at XAI World conference, 2024 ( this https URL )
Abstract:There has been a growing interest in explaining entailments over description logic (DL) knowledge bases. The existing explanation formalisms focus on justifications to explain true axioms, and abductive reasoning to explain missing axioms in a knowledge base. However, these formalisms only point out the reasoning steps behind a (missing) entailment and lack a user-centered approach as they do not consider an inquirer’s needs, level of understanding, or prior knowledge. We propose contrastive explanations, aiming at answering “why an axiom P (fact) is true instead of another axiom Q (foil)” over description logic knowledge bases. The motivation arises from the observation that when a user discovers that P has occurred, they are often surprised because they anticipated the occurrence of another similar event Q. Furthermore, individual explanations for “why P” and “why not Q” are unsatisfactory since a user expects to see the difference between P and Q. In this work, we first present formal foundations of contrasting questions and then define contrastive explanations within description logics. To this end, facts include ABox assertions of the form C(x) for a concept C and individual x. Possible foils for such facts are assertions C(y) (contrasting against an individual y), or D(x) (contrasting against a concept D). Additionally, we explore the properties of contrastive explanations in the DL EL and ALC. We also provide an implementation of our definition and an experimental evaluation on KBs of varying sizes.
[AI-135] SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability
【速读】:该论文旨在解决开放池(open-pool)中低秩适配器(LoRA)的高效复用问题,即如何在仅有少量支持样本的情况下,从已积累的LoRA适配器库中自动选择并组合相关适配器以完成新任务。传统方法虽能实现任务级适配器组合与实例级动态选择,但无法保证参数更新的兼容性或输出可靠性。其解决方案的关键在于提出Sparse-Composition Agreement Layer (SCALE),包含两个核心组件:一是Layer-Adaptive Sparse Residual Composition (LASRC),通过保留线性锚点并块级残差化适配器更新方向来缓解合并干扰;二是可靠性分析层,将稀疏组合视角间的不一致视为可观测的不确定性信号,并基于显式路径成本比较不同策略下的协议一致性、支持损失代理选择和Oracle冗余空间。实验表明,LASRC在固定检索条件下提供单视图增益,而SCALE-support作为无需查询标签的3.0倍可靠性分析变体,在多个基准测试中展现出稳健性能。
链接: https://arxiv.org/abs/2605.01429
作者: Shuaipeng Zhou,Yu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 figure, 6 tables
Abstract:Libraries of Low-Rank Adaptation (LoRA) adapters are becoming a practical by-product of parameter-efficient adaptation. Once such adapters accumulate, a natural question is no longer how to train one adapter for one task, but how to reuse an open pool of adapters for a new task given only a small support set. Prior work has shown that LoRA modules can be composed at the task level and dynamically selected at the instance level. However, open-pool LoRA reuse is not automatic: retrieving relevant adapters does not guarantee that their parameter updates are compatible, and composing adapters does not guarantee reliable outputs. We introduce the Sparse-Composition Agreement Layer (SCALE), a post-retrieval audit and composition framework for open-pool LoRA reuse. SCALE contains a deployable 1.0* merge path, Layer-Adaptive Sparse Residual Composition (LASRC), and a higher-cost reliability-analysis layer for multi-view disagreement. LASRC addresses merge interference by preserving a linear anchor while residualizing block-wise adapter update directions. The reliability layer treats disagreement among sparse composition views as an observable uncertainty signal and compares agreement, support-loss proxy selection, and oracle headroom under explicit path cost. In matched FLAN-T5-Large, BIG-Bench Hard (BBH), and 97-LoRA experiments, LASRC gives a directional single-view gain under fixed retrieval, while SCALE-support is reported as a query-label-free 3.0* reliability-analysis variant rather than as a calibrated or throughput-equivalent selector. Protocol-distinct BBH-8 validation shows the same qualitative trend on three decoder-only backbones. Detailed scores, paired audits, and path-cost records are reported in the experimental section. Comments: 12 pages, 1 figure, 6 tables Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.01429 [cs.AI] (or arXiv:2605.01429v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-136] Quantifying Multimodal Capabilities: Formal Generalization Guarantees in Pairwise Metric Learning
【速读】:该论文旨在解决多模态学习中因模态数据不完整或冗余而导致的泛化性能下降问题,尤其关注模态选择与算法性能之间的理论关系。其解决方案的关键在于通过细粒度的理论分析,揭示不同模态子集对应函数类之间的层次关系,并量化学习映射与真实映射间的差异;进而基于成对复杂度的严格分析,推导出新的泛化误差上界和下界,证明引入细粒度模态特征可通过增强模态互补性来降低假设空间的复杂度,从而提升模型收敛速度与准确性。
链接: https://arxiv.org/abs/2605.01424
作者: Richeng Zhou,Xuelin Zhang,Liyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal learning leverages the integration of diverse data modalities to enhance performance in complex tasks. Yet, it frequently encounters incomplete or redundant modality data in real-world scenarios. This paper presents a fine-grained theoretical analysis of the generalization properties of multimodal metric learning models, addressing critical gaps in understanding the relationship between modality selection and algorithmic performance. We establish hierarchical relationships between function classes corresponding to different modality subsets and quantify the discrepancy between learned mappings and ground truth. Through rigorous analysis of pairwise complexity within the multimodal learning framework, we derive novel generalization error bounds that reveal the joint impact of modality quantity and granularity on model performance. Our theoretical findings on both upper and lower bounds demonstrate that incorporating fine-grained modality features reduces the complexity of the hypothesis space by enhancing modality complementarity. This work offers both theoretical foundations and practical implications for improving convergence rates and accuracy in multimodal learning systems.
[AI-137] Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration Redistribution and Optimization Governance
【速读】:该论文旨在解决大规模学习系统中能力分布不均的问题,即所谓的人工锯齿智能(Artificial Jagged Intelligence, AJI)现象——模型在某些局部任务上表现出强大能力,而在其他领域则表现脆弱或不足。其核心问题是:为何在有限优化资源下,模型的能力提升会呈现出非均匀、结构性的分布?解决方案的关键在于建立一个形式化的理论框架,将训练过程建模为一种有限预算下的优化压力分配机制,其中更新能量(gradient-driven update energy)在参数空间中沿不同能力相关方向进行分配。论文指出,能力锯齿性源于目标函数结构的各向异性、数据几何特性以及表征耦合关系,而非单一的“智能”指标;通过定义能力增益、优化能量占比和锯齿度等概念,并证明累积更新能量集中会导致能力增益分散的下界,从而揭示了优先发展某一能力必然带来其他能力的机会成本,除非存在正向耦合或共享结构来抵消这种代价。这一理论进一步提出能量方差正则化与辅助结构目标等干预手段,作为重塑优化场的方法,实现了对能力形成过程的可预测与可控调节。
链接: https://arxiv.org/abs/2605.01420
作者: Wesley Shu,Peng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Jagged Intelligence (AJI) denotes a recurring pattern in which large learning systems exhibit strong local capabilities while remaining weak or brittle in other domains. This paper develops a formal theory of AJI as uneven allocation of optimization pressure. We model training as a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space. In this model, jagged capability profiles arise from anisotropic objective structure, data geometry, and representational coupling rather than from a single scalar quantity called intelligence. The paper defines capability gain, optimization energy share, and jaggedness, then proves that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem shows why prioritizing one capability can impose opportunity costs on others unless positive coupling or shared structure offsets the cost. The analysis also studies redistribution mechanisms, including energy-variance regularization and auxiliary structural objectives, as interventions that reshape the optimization field. The resulting framework links uneven emergence, training architecture, and optimization governance. It predicts that early concentration of update energy should forecast later capability jaggedness; that scaling under a narrow objective need not eliminate anisotropy; and that explicitly funded auxiliary objectives can revive neglected capabilities. AJI is therefore not merely a descriptive label for uneven model behavior, but a testable theory of how finite optimization resources produce concentrated, delayed, and structurally uneven capability formation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01420 [cs.AI] (or arXiv:2605.01420v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-138] meTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization
【速读】:该论文旨在解决时间序列生成模型在时间粒度(temporal granularity)控制上的局限性问题,即现有模型通常无法根据用户需求灵活调整生成结果的时间分辨率,只能输出固定粒度的时序数据。为实现真正由用户驱动的时间序列生成,作者提出TimeTok框架,其核心创新在于一种分层标记化策略(hierarchical tokenization strategy),该策略将时间序列映射为从粗到细的有序标记序列,并通过跨粒度层级的自回归生成过程生成标记块,最终解码为连续时间序列。这种设计使得模型能够在单一框架内实现任意目标粒度的时间序列生成(Granularity-Controllable Time-Series Generation, GC-TSG),并通过控制生成的标记块数量显式调控输出细节,从而突破传统模型对固定粒度的依赖。
链接: https://arxiv.org/abs/2605.01418
作者: Seokhyun Lee,Jaeho Kim,Changjun Oh,Mihaela van der Schaar,Changhee Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time-series generative models often lack control over temporal granularity, forcing users to accept whatever granularity the model produces. To enable truly user-driven generation, we introduce TimeTok, a unified framework for Granularity-Controllable Time-Series Generation (GC-TSG), which generates time series at any target granularity from any coarser input (e.g., rough sketches) or from scratch. At the core of TimeTok is a hierarchical tokenization strategy that maps time series into an ordered sequence of tokens, from coarse to fine temporal granularity. Our autoregressive generation process operates across these granularity levels, producing token blocks that are decoded back into continuous time series. This design naturally enables GC-TSG - including standard generation - within a single framework, where controlling the number of token blocks provides explicit control over output detail. Experiments show that TimeTok excels at GC-TSG tasks while achieving state-of-the-art performance in standard generation. Furthermore, we showcase TimeTok’s potential as a foundational tokenizer by training on multiple datasets with heterogeneous temporal granularities, verifying strong transferability that consistently outperforms models trained on individual datasets. To our knowledge, this is the first unified framework that covers the full generative spectrum for time series, offering a valuable foundation for models that benefit from diverse temporal granularities.
[AI-139] AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)部署摩擦的降低,传统安全机制难以应对由决策密度上升引发的系统性不可逆风险。论文指出,AI 的低边际成本复制与嵌入能力使得决策能量(decision-energy)在高效节点中集中,从而可能引发即使局部错误率低也导致全局不可逆损失的风险。解决方案的关键在于提出“边界稳定定理”——即安全不依赖于证明高级系统始终正确,而是通过制度与技术设计,防止单个高效率节点释放不可逆权力;具体而言,需明确界定并强化三大主权边界:不可逆决策权、物理资源动员权和自我扩展权,并构建分层控制、授权机制及外部可审查限制,将 AI 安全从单一对齐问题拓展为涵盖对齐、安全工程、组织经济学与制度设计的系统性框架。
链接: https://arxiv.org/abs/2605.01415
作者: Wesley Shu,Peng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Recent AI systems compress the distance between capability growth and capability deployment. Earlier high-risk technologies were slowed by capital intensity, physical bottlenecks, organizational inertia, and specialized supply chains. By contrast, AI capabilities can be copied, invoked, embedded in workflows, and scaled across institutions at low marginal cost. This paper argues that declining deployment friction changes the safety problem at its root. Safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density. The paper formalizes this claim through decision-energy density: the rate-weighted capacity of a node to generate, evaluate, select, and execute consequential decisions. It then identifies three sovereignty boundaries that determine whether AI remains an amplifier within a human-governed system or becomes a de facto control center: irreversible decision authority, physical resource mobilization authority, and self-expansion authority. The model shows how efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node. This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low. The main result is a boundary stabilization theorem. It shows that safety need not require proving that advanced systems are always correct. Instead, it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node. The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2605.01415 [cs.AI] (or arXiv:2605.01415v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-140] AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits
【速读】:该论文旨在解决模拟与混合信号(Analog and Mixed-Signal, AMS)电路设计中依赖人工标注数据的问题,即当前AI驱动的自动化工具严重受限于需由领域专家手动标注功能和性能信息的数据集构建流程,而现有大语言模型(Large Language Models, LLMs)和视觉模型无法自动完成此类标注任务。其解决方案的关键在于提出AMSnet-q,一个完全自动化、无监督的流水线框架,通过将原理图图像直接转化为带标签的AMS电路数据库,实现了从原理图到网表转换、拓扑感知测试平台生成以及基于仿真的尺寸验证的全流程自动化,从而无需人工介入即可客观判定电路功能,显著降低了人力成本并提升了数据库构建的可扩展性与客观性。
链接: https://arxiv.org/abs/2605.01404
作者: Ze Zhang,Junzhuo Zhou,Yichen Shi,Zhuofu Tao,Rui Ji,Zhiping Yu,Quan Chen,Ting-Jung Lin,Lei He
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Analog and mixed-signal (AMS) circuit design remains heavily reliant on expert knowledge. While recent AI-driven automation tools can generate candidate topologies, they critically depend on manually curated datasets with functional and performance annotations – a requirement that current large language models (LLMs) and vision models cannot automate. Existing approaches still require domain experts to manually interpret circuit functionality. We present AMSnet-q, a fully automated, unsupervised pipeline that eliminates human-in-the-loop annotation by converting schematic images directly into a labeled AMS circuit database. Unlike prior work that stops at netlist extraction, our framework automates the complete verification loop: it performs schematic-to-netlist conversion, topology-aware testbench generation, and simulation-based sizing validation to objectively determine circuit functionality. Validated in 28 nm technology, AMSnet-q processed 739 schematics from the AMSnet 1.0 dataset, automatically constructing a repository of 4 circuit classes, 105 distinct topologies, and 89,789 labeled device configurations. By decoupling human effort from dataset volume and reducing the workload to a one-time testbench template per circuit class, AMSnet-q enables scalable, objective, and fully automated AMS database construction. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01404 [cs.AR] (or arXiv:2605.01404v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.01404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-141] LiveFMBench: Unveiling the Power and Limits of Agent ic Workflows in Specification Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在自动为 C 程序生成形式化规范(formal specification)时存在的准确性与可靠性问题,特别是针对当前评估方法可能因数据污染和模型欺骗行为而高估性能的缺陷。其关键解决方案是构建了一个持续演化的基准测试集 LiveFMBench,包含 630 个使用 ACSL(ANSI/ISO C Specification Language)标注的 C 程序,并特别设计了 360 个新收集案例以减少数据泄露风险;在此基础上系统评估了直接提示、推理增强(thinking mode)、代理流水线(agentic pipeline)等策略,并通过细粒度失败分析揭示了模型的主要错误类型(如循环不变量错误),从而更真实地反映 LLM 和代理在形式化规范生成中的实际能力边界。
链接: https://arxiv.org/abs/2605.01394
作者: Dong Xu,Jialun Cao,Guozhao Mo,Junjie Hu,Cheng Wen,Hongyu Lin,Xianpei Han,Shengchao Qin,Cong Tian,Shing-Chi Cheung,Le Sun,Yaojie Lu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates performance because models under direct prompting may exhibit unfaithful behaviors, such as deceiving automated provers or ignoring code-context constraints; after excluding such cases, the true specification generation accuracy drops by approximately 20%. We further find that both increased sampling and thinking mode significantly improve success rates, with smaller models benefiting more from thinking mode. Agentic pipelines are particularly effective under low sampling budgets and on harder datasets. Failure analysis further shows that incorrect loop invariants are the dominant error type, while agentic pipelines notably reduce assertion errors. These results expose fundamental limitations in current LLM-based approaches and suggest they remain far from replacing human-authored formal specifications. We release LiveFMBench at this https URL and all evaluation artifacts to support future research.
[AI-142] Using LLM s in Software Design: An Empirical Study of GitHub and A Practitioner Survey
【速读】:该论文旨在解决当前关于大型语言模型(Large Language Models, LLMs)在软件设计领域应用的研究空白问题,即缺乏对开发者如何使用LLMs进行软件设计、其带来的益处与局限性的实证理解。解决方案的关键在于通过混合方法研究:一方面挖掘GitHub上291个开发者与ChatGPT的对话记录,另一方面开展针对65名软件实践者的问卷调查,从而系统识别出九类由ChatGPT支持的设计任务,并深入刻画开发者与LLM的交互模式,揭示其在知识获取和代码生成方面的核心作用,同时归纳出七项主要优势与六项关键限制,为后续LLMs在软件设计中的有效集成提供实证基础与方向指引。
链接: https://arxiv.org/abs/2605.01392
作者: Yifei Wang,Ruiyin Li,Peng Liang,Yangxiao Cai,Zengyang Li,Mojtaba Shahin,Arif Ali Khan,Qiong Feng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 29 pages, 8 images, 6 tables, Manuscript submitted to a Journal (2026)
Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgment. However, there has been little research focusing on how LLMs are used in software design, nor on the associated benefits and drawbacks. This paper aims to bridge this gap by empirically investigating how software developers utilize LLMs in the context of software design. We conduct a mixed-methods study, combining a mining study of 291 developer-ChatGPT conversations shared on GitHub with a survey of 65 software practitioners. Our findings reveal nine distinct categories of design tasks supported by ChatGPT, including architecture design, data model design, and the use of design patterns. We further characterize developer-ChatGPT interactions, showing that developers primarily use ChatGPT for knowledge acquisition and design-related code generation, with most tasks situated at the detailed design level. The study identifies seven key benefits of utilizing LLMs in software design as perceived by developers, such as better technology selection and the early detection of design flaws. We also uncover six limitations, including the generation of overly lengthy and difficult-to-read outputs, the creation of inexecutable or incorrect code, and a heavy reliance on context that can lead to hallucinated results. These findings provide an evidence-based characterization of current LLM use in software design from both open-source and practitioner perspectives, highlighting a tension between perceived benefits and limitations, which lays a foundation for future research and the development of effective techniques and tools to integrate LLMs into software design practices.
[AI-143] A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
【速读】:该论文试图解决当前人工智能系统因基于过于简化的神经科学模型而导致的“真实与虚假信息边界模糊化”问题,即AI在追求奖励最大化过程中放大对信息的关注,却缺乏内在机制来评估信息的有效性或相关性,从而加剧信息过载和偏见,引发判断失误甚至有害行为,这种现象被称为“心智-现实过载困境”(mind-reality overload dilemma)。解决方案的关键在于开发基于锥体神经元(pyramidal neurons)生物物理动力学的更先进AI工具,这些神经元支撑一种内在的主动精度机制(active precision mechanism),能够在信息被注意或传播前,利用局部与全局一致的预测来评估证据的有效性和情境适配性,优先保证一致性与适配性后再分配注意力,从而减少无效信息干扰、增强可靠信息的传播,有助于重建认知条件并支持形成更合理的信念与判断。
链接: https://arxiv.org/abs/2605.01376
作者: Ahsan Adeel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current AI systems, grounded in oversimplified neuroscience, risk eroding the distinction between truth and falsehood. They maximize reward by amplifying attention to information without intrinsic precision mechanisms to assess whether it is valid or worth attending to. This increases both the volume of information and the inherent biases in what the system attends to, whether true, false, or irrelevant. If not corrected, this trend will accelerate, threatening to overload systems and individuals with biased and dubious information and increasing the risk of confusion, poor judgment, and irrational or harmful decisions and behaviour, a condition I term the mind-reality overload dilemma. I argue that this threat may be mitigated by providing the public with access to more advanced AI tools built on the biophysical dynamics of pyramidal neurons underlying awake thought and higher-order cognition. These neurons support an intrinsic active precision mechanism that, rather than merely maximizing reward, uses locally and globally coherent predictions to evaluate the validity and contextual adequacy of evidence before it is attended to or propagated through hierarchies, prioritizing coherence and adequacy before attention.~While this approach does not derive or prescribe moral rules from biology, it may give rise to AI with more “real understanding”, helping restore epistemic conditions by reducing information overload and amplifying reliable information, thereby supporting the formation of better-informed beliefs and more coherent judgments that benefit society at large-though no guarantees exist.
[AI-144] Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
【速读】:该论文旨在解决如何系统评估计算模型在类比与隐喻认知机制上的认知合理性问题。其解决方案的关键在于引入并形式化操作化最小认知网格(Minimal Cognitive Grid, MCG)框架,通过量化分析三个核心维度——功能/结构比率(Functional/Structural Ratio)、泛化能力(Generality)以及性能匹配度(Performance Match),对包括结构映射引擎(Structure-Mapping Engine, SME)、CogSketch、METCL及大型语言模型(Large Language Models, LLMs)在内的主流计算模型进行比较评估,从而依据一致且可推广的数学标准判断各模型与标准认知理论的契合程度。
链接: https://arxiv.org/abs/2605.01359
作者: Alessio Donvito,Antonio Lieto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages
Abstract:In this paper, we employ the Minimal Cognitive Grid (MCG), a framework created to evaluate the cognitive plausibility of artificial systems, to offer a systematic assessment of leading computational models of analogy and metaphor, including the Structure-Mapping Engine (SME), CogSketch, METCL, and Large Language Models (LLMs). We present a formal and quantitative operationalization of the MCG framework and, through the analysis of its three main dimensions (Functional/Structural Ratio, Generality, and Performance Match), examine how well each system aligns with standard cognitive theories of the modeled phenomena, thus allowing for comparison of the models with respect to their cognitive plausibility, according to consistent and generalizable mathematical criteria.
[AI-145] Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data
【速读】:该论文旨在解决在缺乏或仅有极少不安全样本的离线数据中学习满足约束的策略问题,尤其是在高风险场景下无法通过试错获取不安全数据时,传统方法因忽略“安全但不可行状态”(即当前满足约束但在几步内必然违反约束的状态)而导致部署失败的问题。解决方案的关键在于提出PROCO框架,其核心是利用大语言模型(LLMs)将自然语言形式的安全知识注入到模型中,构建保守的成本函数以估计风险,并结合学习到的动力学模型进行基于模型的回放,合成多样化的反事实不安全样本,从而实现可靠的状态可行性识别与可行性引导的策略学习。
链接: https://arxiv.org/abs/2605.01356
作者: Ruiqi Xue,Lei Yuan,Kainuo Cheng,Jing-Wen Yang,Yang Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.
[AI-146] VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU
【速读】:该论文旨在解决GPU-based仿真环境中CUDA计算与Vulkan图形渲染任务因执行隔离导致的资源利用率低下问题。现有方法要么局限于CUDA生态的时空共享,要么采用时间分片策略造成硬件资源闲置,无法充分发挥GPU并行能力。其解决方案的核心在于提出VUDA系统,通过两个关键观察实现CUDA与Vulkan工作负载的空间并行:一是二者虽编程抽象不同,但在驱动和硬件层均收敛于相同的通道(channel)原语;二是二者虚拟地址空间天然分离,可安全合并页表而无需重映射。VUDA通过通道重定向将CUDA任务调度至Vulkan域,并结合页表嫁接统一内存空间,从而消除关键路径上的数据拷贝,实现高效的跨域空间复用。实验表明,相比传统时序共享方案,VUDA在典型具身智能任务中提升吞吐量达85%,同时显著提高GPU利用率并降低端到端延迟。
链接: https://arxiv.org/abs/2605.01352
作者: Bin Xu,Pengfei Hu,Wenxin Zheng,Jinyu Gu,Haibo Chen
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios – simulation data generation and RL training – can be naturally adapted to execute their simulation and rendering phases concurrently, presenting a significant opportunity to improve GPU utilization through spatial multiplexing. However, a fundamental obstacle we term execution isolation prevents this: CUDA and Vulkan create separate GPU contexts whose channels are bound to different scheduling groups, confining compute and graphics to mutually exclusive time slices. Existing spatial-sharing techniques are limited to the CUDA ecosystem, while temporal-sharing approaches underutilize available resources. This paper presents VUDA, a system that breaks execution isolation to enable spatial parallelism between CUDA compute and Vulkan graphics workloads. VUDA is built on two key observations: although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level; meanwhile, their virtual-address spaces are inherently disjoint, making safe page-table merging feasible without remapping. VUDA exposes a thin API for developers to annotate co-schedulable CUDA streams, and realizes spatial sharing through channel redirection into Vulkan’s scheduling domain and page-table grafting to unify address spaces, eliminating all data copying on the critical path. Experiments on representative embodied-AI workloads show that VUDA delivers up to 85% higher throughput than temporal-sharing baselines, while improving GPU utilization and reducing end-to-end latency. Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.01352 [cs.OS] (or arXiv:2605.01352v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2605.01352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-147] ABox Abduction for Inconsistent Knowledge Bases under Repair Semantics
【速读】:该论文旨在解决不一致知识库(Knowledge Base, KB)下的ABox归因问题,即在给定一个非蕴含事实的情况下,如何通过扩展KB使其成为蕴含事实。传统研究主要集中在一致KB和经典语义下,而本文关注的是由错误数据导致的不一致KB场景,这在现实应用中更为常见。解决方案的关键在于引入基于修复语义(repair semantics)的归因定义,并提出指导性准则以生成“有用”的假设。作者通过系统分析轻量描述逻辑DL-Lite和EL_bot的不同归因变体,构建了该问题在修复语义下的复杂性全景图,从而为不一致环境下的推理提供形式化基础与计算可行性保障。
链接: https://arxiv.org/abs/2605.01341
作者: Anselm Haak,Patrick Koopmann,Yasir Mahmood,Anni-Yasmin Turhan
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and repair. ABox abduction has been well-investigated for consistent KBs and classical semantics, but little is known for the case of inconsistent KBs, which can be caused by erroneous data. In this paper we define suitable notions of abduction in this setting and propose criteria that guide abduction towards “useful” hypotheses. To regain meaningful reasoning in the presence of inconsistencies, we use well-established repair semantics. We provide a comprehensive landscape of the complexity of ABox abduction under repair semantics, treating different variants of the abduction problem for the light-weight description logics DL-Lite and EL_bot.
[AI-148] DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams
【速读】:该论文旨在解决系统级芯片设计图(system-level diagrams)中非标准化符号与结构化训练数据稀缺导致现有多模态大语言模型(Multimodal Large Language Models, MLLMs)难以准确识别和理解的问题。其解决方案的关键在于构建首个针对系统级图纸的多模态数据集DiagramNet,包含10,977条连接标注和15,515个链式思维问答对,并提出一种分阶段的渐进式训练流程与解耦的多智能体工作流,将复杂的视觉推理分解为感知(Perception)、推理(Reasoning)和知识(Knowledge)三个阶段。该方法在端到端评估中显著超越当前主流模型,且具备良好的泛化能力,仅用60张图像即可实现对AMSBench数据集的有效迁移,达到与GPT-5和Claude-Sonnet-4相当甚至更优的零样本连通性推理性能。
链接: https://arxiv.org/abs/2605.01338
作者: Jincheng Lou,Ruohan Xu,Jiapeng Li,Junyin Pi,Runzhe Tao,Weijian Fan,Xiao Tan,Guojie Luo,Yibo Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures. Preprint
Abstract:System-level diagrams encode the architectural blueprint of chip design, specifying module functions, dataflows, and interface protocols. However, non-standardized symbols and the scarcity of structured training data hinder existing multimodal large language models (MLLMs) from recognizing these diagrams. To address this gap, we introduce DiagramNet, the first multimodal dataset for system-level diagrams, comprising 10,977 connection annotations and 15,515 chain-of-thought QA pairs across four tasks: Listing, Localization, Connection, and Circuit QA. Building on this dataset, we propose a progressive training pipeline together with a decoupled multi-agent workflow that decomposes complex visual reasoning into Perception, Reasoning, and Knowledge stages. On the DiagramNet benchmark, integrating our 3B-parameter model with the proposed workflow surpasses the 2025 EDA Elite Challenge winner and outperforms GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2x in end-to-end evaluation. Notably, the workflow generalizes beyond our model, boosting Task 1 performance by 128.7x for Gemini-2.5-Pro and 12.4x for GPT-5. Furthermore, with only 60 images for detector adaptation, the method transfers effectively to AMSBench, achieving zero-shot connectivity reasoning on par with GPT-5 and Claude-Sonnet-4 while surpassing the AMS state-of-the-art method Netlistify.
[AI-149] ruth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents
【速读】:该论文旨在解决persona代理在面对矛盾信息(如虚假信息)时是否存在群体内偏袒(in-group favoritism)及其如何缓解此类偏袒带来的负面影响的问题。研究表明,persona代理在群体内成员提供错误答案时接受率显著高于群体外成员,且这种偏袒在可辩驳推理情境中依然存在并随认知复杂度上升而加剧。解决方案的关键在于提出三种干预策略:身份盲指令(Identity-Blind Instruction)、结构化反事实推理(Structured Counterfactual Reasoning)和异质视角集成(Heterogeneous Perspective Ensemble),通过减少身份线索影响、强化逻辑校验机制与引入多元视角来有效缓解群体内偏袒行为。
链接: https://arxiv.org/abs/2605.01329
作者: Shijun Lei,Hongyu Wang,Yunji Liang,Haowen Zheng,Bin Guo,Zhiwen Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages. Under review
Abstract:In-group favoritism refers to the phenomena of favoring members of one’s in-group over out-group members and is widely observed in numerous social cooperative behaviors. Recently, in-group favoritism biases have also been identified in generative language models. However, whether the in-group favoritism exists when persona agents are faced with contradicting information (e.g., misinformation), and how to mitigate the adverse effects of in-group favoritism biases in persona agents have been understudied. To address these problems, we propose a Truth or Tribe simulation framework to study the agent cooperation within the spread of contradicting information through a triadic interaction paradigm, and conduct controlled trials to evaluate the primary moderating factors. Extensive results showcase that persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies–Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble–are proposed to mitigate the in-group favoritism.
[AI-150] Segment-Aligned Policy Optimization for Multi-Modal Reasoning
【速读】:该论文旨在解决现有强化学习(Reinforcement Learning, RL)方法在大语言模型(Large Language Models, LLMs)中因策略优化粒度与推理过程的自然分步结构不匹配而导致的信用分配偏差(credit assignment bias)和训练不稳定问题,尤其是在多模态推理任务中。解决方案的关键在于提出一种新的强化学习范式——段对齐策略优化(Segment-Aligned Policy Optimization, SAPO),其核心是将连贯的推理步骤(reasoning segments)作为策略更新的基本单元,而非单个token或完整响应序列;SAPO通过构建面向推理段的马尔可夫决策过程抽象(step-wise Markov decision process abstraction),并引入语义对齐的段级价值估计、优势计算和重要性采样机制,实现了更精确的策略梯度估计和更稳定的训练过程。
链接: https://arxiv.org/abs/2605.01327
作者: Lei Gao,Zhuoming Li,Mengxi Jia,Jiakang Yuan,Hongbo Sun,Hao Sun,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.
[AI-151] GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning IJCAI2026
【速读】:该论文旨在解决图自监督学习(Graph Self-Supervised Learning)中因依赖大规模无标签数据集而导致计算成本高昂的问题。研究表明,这些数据集中存在显著冗余——均匀子采样50%的图仍能保留超过96%的下游任务性能。解决方案的关键在于提出GraphSculptor方法,通过两种互补视角构建预训练核心集(coreset):一是基于图内在结构统计量提取结构多样性特征向量,二是利用预训练语言模型对图到文本生成的描述进行语义编码以捕获上下文语义多样性。该方法将两类信号映射至统一度量空间,并采用聚类感知选择策略以保留联合的结构-语义多样性,从而在显著降低计算开销的同时保持高性能。此外,作者还推导了核心集与全数据预训练之间的损失差距理论边界,为选择策略提供了理论依据。实验表明,仅使用10%的数据即可达到接近全数据的性能(99.6%),同时预训练时间减少近90%。
链接: https://arxiv.org/abs/2605.01310
作者: Chuang Liu,Zelin Yao,Xueqi Ma,Luzhi Wang,Mukun Chen,Pinghua Xu,Wenbin Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, Accepted by IJCAI 2026
Abstract:Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode descriptions generated via graph-to-text. GraphSculptor integrates these signals into a unified metric space and performs cluster-aware selection to preserve joint structural-semantic diversity. We further derive a theoretical bound on the loss gap between coreset and full-data pre-training, offering theoretical motivation for our selection formulation. Extensive experiments demonstrate that GraphSculptor effectively sculpts the dataset: a 10% coreset achieves 99.6% of full-data performance while reducing pre-training time by nearly 90%, offering a scalable solution for data-efficient graph pre-training.
[AI-152] Are we Doomed to an AI Race? Why Self-Interest Could Drive Countries Towards a Moratorium on Superintelligence
【速读】:该论文试图解决的问题是:在地缘政治超级大国之间,如何通过博弈论分析来论证对人工超智能(Artificial Superintelligence, ASI)实施禁令(moratorium)是否符合国家的自身利益。传统观点普遍认为技术领先地位优先于风险管控,但本文挑战这一认知,旨在阐明在特定条件下,主动暂停ASI研发反而是一种理性且符合国家利益的战略选择。解决方案的关键在于构建一个基于博弈论的模型,量化技术优势收益与失控风险之间的权衡关系,并证明当失控成本足够高时,各国均会倾向于采取自我约束的禁令策略;此外,论文还提供了实证证据表明全球对ASI风险的认知正在上升,从而增强了这种理性禁令在当前国际格局下的可行性与稳定性。
链接: https://arxiv.org/abs/2605.01297
作者: Edward Roussel,Lode Lauwaert,Torben Swoboda,Grant Ramsey,Risto Uuk,Leonard Dung
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures
Abstract:This paper uses game theory to argue that, contrary to the prevailing view, a moratorium on Artificial Superintelligence (ASI) can be in a state’s self-interest. By formalizing trategic interactions between geopolitical superpowers, we model the trade-off between the benefits of technological supremacy and the catastrophic risks of uncontrolled ASI. The analysis reveals that as the perceived cost of loss of control increases sufficiently relative to other parameters, it becomes in each state’s self-interest to impose a moratorium. We further provide empirical evidence suggesting that the global perception of ASI risk is rising, making a stable, rational moratorium increasingly plausible in the current geopolitical landscape.
[AI-153] Autonomous Drift Learning in Data Streams: A Unified Perspective
【速读】:该论文旨在解决自主学习系统在非平稳环境中的适应性问题,即传统假设数据分布和模型行为恒定(stationarity)在现实场景中不成立的问题。其解决方案的关键在于提出一个三维分类框架,从时间流漂移(time stream drift)、数据流漂移(data stream drift)和模型流漂移(model stream drift)三个维度系统化地刻画非平稳性的本质:时间流漂移区分随机任意模式与结构化节奏动态;数据流漂移分离特征表示变化(representation drift)与语义变化(semantic drift);模型流漂移则通过顺序可塑性(sequential plasticity)、去中心化异质性(decentralized heterogeneity)和策略不稳定性(policy instability)揭示学习系统的内生演化机制。这一框架整合了漂移适应、持续学习与时间泛化等碎片化研究范式,为构建能够通过持续变化自主进化的智能系统提供了理论基础与研究路径。
链接: https://arxiv.org/abs/2605.01295
作者: Xiaoyu Yang,En Yu,Jie Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Survey Paper, 20 pages
Abstract:In the pursuit of autonomous learning systems, the foundational assumption of stationarity, the premise that data distributions and model behaviors remain constant, is fundamentally untenable. Historically, the research community has addressed non-stationary environments almost exclusively under the scope of concept drift, focusing primarily on temporal shifts in streams. However, as learning systems become increasingly autonomous and complex, merely adapting to temporal non-stationarity is no longer sufficient. Evolving beyond this traditional perspective, we propose a novel, three-dimensional taxonomy that systematizes the field based on the operational state of the system. First, time stream drift distinguishes between stochastic arbitrary patterns and structural rhythmic dynamics. Second, data stream drift disentangles shifts in feature representations, identified as representation drift, from changes in underlying semantics, recognized as semantic drift. Third, model stream drift characterizes the internal endogenous divergence of learning systems through the lenses of sequential plasticity, decentralized heterogeneity, and policy instability. Based on this framework, we systematically review 193 representative studies and identify key open challenges. By bridging the fragmented paradigms of drift adaptation, continual learning, and temporal generalization, this survey outlines a roadmap for building self-evolving intelligent systems capable of learning autonomously through continuous change.
[AI-154] Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agent ic Tasks ICML2026
【速读】:该论文旨在解决基础模型驱动的智能体在长周期规划中因纯提示推理的瞬时性而导致的性能瓶颈问题。现有技能归纳方法虽通过将经验提炼为无状态参数化脚本缓解此问题,但无法捕捉动态环境中执行所需的条件逻辑。其解决方案的关键在于提出神经符号技能归纳(Neuro-Symbolic Skill Induction, NSI)框架,该框架将交互轨迹提升为模块化、逻辑 grounded 的程序,通过显式控制流与动态变量绑定的合成,使智能体能够自主识别“何时”和“为何”采取行动,从而实现从少量示例中高效归纳技能并灵活适应未见目标的能力。
链接: https://arxiv.org/abs/2605.01293
作者: Jie-Jing Shao,Haiyan Yin,Yueming Lyu,Xingrui Yu,Lan-Zhe Guo,Ivor Tsang,James Kwok,Yu-Feng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Foundation model-driven agents often struggle with long-horizon planning due to the transient nature of purely prompting-based reasoning. While existing skill induction methods mitigate this by distilling experience into state-blind parameterized scripts, they fail to capture the conditional logic required for robust execution in dynamic environments. In this paper, we propose Neuro-Symbolic Skill Induction (NSI), a framework that lifts interaction traces into modular, \textitlogic-grounded programs. By synthesizing explicit control flows and dynamic variable binding, NSI empowers agents to discover \textitwhen and \textitwhy to act. This paradigm enables the efficient generalization, allowing agents to induce skills from few-shot examples and flexibly adapt to unseen goals. Experiments on a series of agentic tasks demonstrate that NSI consistently outperforms state-of-the-art baselines, empowering agents to self-evolve into architects of logic-grounded skills.
[AI-155] Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations Not Just Heuristics
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理服务中长期依赖通用启发式算法所带来的性能不可靠问题。当前主流服务系统如vLLM和SGLang仍沿用经典分布式计算中的请求调度策略(如最短队列优先或轮询)、FIFO调度以及LRU缓存淘汰机制,这些方法未能充分考虑LLM推理特有的结构特性,包括动态增长的键值缓存(Key-Value Cache, KV cache)内存、预填充(prefill)与解码(decode)阶段的不对称性、输出长度未知性及持续批处理约束。论文指出,解决方案的关键在于构建能够刻画上述特性的数学模型,并在此基础上设计具有理论保证的算法,从而实现跨多样化工作负载的可证明性能保障,而非依赖在特定场景下有效但缺乏稳定性的启发式策略。
链接: https://arxiv.org/abs/2605.01280
作者: Zijie Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:This position paper argues that LLM inference serving has outgrown generic heuristics and now demands mathematical optimization and algorithmic foundations. Despite rapid advances in serving systems such as vLLM and SGLang, their algorithmic cores remain largely unchanged from classical distributed computing: request routing uses join-shortest-queue or round-robin, scheduling defaults to FIFO, and KV cache eviction follows LRU. These general-purpose policies ignore the distinctive structure of LLM inference–dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. We contend that the field must develop mathematical models capturing these characteristics, enabling the design of algorithms with provable performance guarantees across diverse workloads, rather than heuristics that may succeed in some scenarios but fail unpredictably in others. Emerging work at the intersection of operations research and ML systems demonstrates that principled methods can match or exceed heuristic performance while providing theoretical guarantees. We call on the community to recognize algorithmic design for LLM serving as a research frontier.
[AI-156] Valley3: Scaling Omni Foundation Models for E-commerce
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在电商场景中对文本、图像、视频和音频等多模态信息理解与推理能力不足的问题,尤其是在短视频等复杂交互场景下缺乏原生多语言音频处理能力的瓶颈。解决方案的关键在于提出一个四阶段的“全栈式”电商持续预训练流程,逐步赋予模型音频理解、跨模态指令遵循、电商领域知识以及长上下文推理能力;同时通过后训练优化引入可控的多级思维模式(非思考模式与三种不同深度的思考层级),平衡简单任务的推理效率与复杂任务的深度分析需求,并集成代理式搜索功能以主动调用工具获取任务相关电商信息,从而构建出具备统一理解与推理能力的全模态电商大模型 Valley3。
链接: https://arxiv.org/abs/2605.01278
作者: Zeyu Chen,Guanghao Zhou,Qixiang Yin,Ziwang Zhao,Huanjin Yao,Pengjiu Xia,Min Yang,Cen Chen,Minghui Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.
[AI-157] Uncertainty-Aware Trip Purpose Inference from GPS Trajectories via POI Semantic Zones and Pareto Calibration
【速读】:该论文旨在解决从大规模GPS轨迹数据中自动标注出行目的(trip purpose)的难题,核心挑战包括个体层面缺乏真实标签、GPS噪声导致的空间不确定性、POI覆盖不全以及不同出行目的间的行为差异。解决方案的关键在于提出一种弱监督框架,融合邻里级POI语义区域与距离加权空间似然,区分强制性(mandatory)与非强制性(non-mandatory)活动的推理策略,并采用多阶段帕累托优化,在无需人工标注的前提下同时最小化出行分布与家庭出行调查统计之间的分布差异(Jensen-Shannon散度)并最大化推断可靠性。该方法在洛杉矶8100万条停留点上验证,显著降低了活动类型、起始时间和持续时间的JSD指标,为交通需求建模和政策分析提供了可扩展且具备不确定性感知的语义化移动数据路径。
链接: https://arxiv.org/abs/2605.01257
作者: Bo Yang,Haoxuan Ma,Yifan Liu,Zhiyuan Zhang,Chris Stanford,Morgan Sun,Jiaqi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale GPS trajectory data offer rich observations of human mobility, yet assigning trip purposes to detected stops remains challenging due to the absence of individual-level ground truth, spatial uncertainty from GPS noise and incomplete points of interest (POIs) coverage, and fundamental behavioral differences across trip purposes. We propose a weakly supervised framework integrating neighborhood-level POI semantic zones with distance-weighted spatial likelihoods, differentiated inference strategies for mandatory and non-mandatory activities, and a multi-phase Pareto optimization that jointly minimizes distributional divergence from household travel survey statistics and maximizes inference reliability without requiring annotated labels. Evaluated on over 81 million staypoints in Los Angeles, the framework reduces activity type frequency Jensen-Shannon distance (JSD) by 23%, start time JSD by 48%, and duration JSD by 12% respectively relative to a comparable baseline. The proposed approach provides a scalable and uncertainty-aware path from raw GPS trajectories to semantically annotated mobility data for travel demand modeling and transportation policy analysis.
[AI-158] EO-Gym: A Multimodal Interactive Environment for Earth Observation Agents
【速读】:该论文旨在解决地球观测(Earth Observation, EO)分析中交互性不足的问题,即现有EO基准测试将复杂的多轮、多模态分析过程简化为固定输入的单次任务,难以反映真实场景下的动态推理需求。其解决方案的关键在于构建了一个名为EO-Gym的可控可执行框架,该框架以Gymnasium风格定义本地地理空间工作区,集成超过660,000个多模态文件(按地理位置、时间和传感器类型索引),并提供35个面向EO任务的专用工具,覆盖六类典型分析流程。通过此环境,作者进一步构建了EO-Gym-Data基准数据集(包含9,078条轨迹和34,604步推理步骤),并基于公开EO数据集及Landsat与Sentinel-2影像进行评估,验证了交互式多模态推理对模型能力的更高要求,并提出EO-Gym-4B作为基线模型,在Pass@3指标上从0.49提升至0.74,表明该框架能有效推动EO代理在时空规划和跨模态感知方面的研究进展。
链接: https://arxiv.org/abs/2605.01250
作者: Sai Ma,Zhuang Li,Sichao Li,Xinyue Xu,Ruibiao Zhu,Tony Boston,John A. Taylor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating 10 open and closed VLMs shows that strong general-purpose models still struggle with interactive EO reasoning, especially on temporal and cross-modal workflows. As a reference baseline, EO-Gym-4B, obtained by fine-tuning Qwen3-VL-4B-Instruct on EO-Gym-Data, improves overall Pass@3 from 0.49 to 0.74 under the main evaluation setting. O-Gym provides a reproducible environment for interactive EO agents, operationalizing EO as an evidence-gathering problem that requires planning across geospatial, temporal, and sensing modality.
[AI-159] Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
【速读】:该论文旨在解决大规模脑影像(fMRI)表示学习中,区域感知掩码策略与混合序列建模机制对预训练效果影响不明确的问题。其核心解决方案是提出Rhamba框架,通过结合解剖学引导的区域感知掩码(region-aware masking)与混合注意力-状态空间模型(Attention-Mamba)架构,实现更精准、可解释且高效的fMRI表征学习。关键创新在于:1)设计三种空间特异性递增的掩码策略(Any、Majority、Pure)以探索不同区域信息利用方式;2)构建两种混合编码器-解码器结构(Attention-Mamba和Mamba-Attention),平衡局部与全局依赖建模能力,从而在下游精神疾病分类任务中显著提升性能并增强模型可解释性。
链接: https://arxiv.org/abs/2605.01240
作者: Ruthwik Reddy Doodipala,Pankaj Pandey,Pratheek Eranki,Carolina Torres-Rojas,Manob Jyoti Saikia,Ranganatha Sitaram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised pretraining is promising for large-scale neuroimaging, yet the impact of region-aware masking and hybrid sequence modeling remains underexplored. In this work, we introduce Rhamba, a region-aware pretraining framework that integrates anatomically guided masking with hybrid Attention-Mamba architectures for resting state functional magnetic resonance imaging (fMRI) analysis. Models were pretrained on the ABIDE dataset using region-aligned patch embeddings and three masking strategies (Any, Majority, and Pure) with increasing spatial specificity. We evaluated four architectural variants: a Mamba only model, an Alternate architecture with interleaved Mamba and Attention blocks, and two hybrid encoder-decoder configurations (Attention-Mamba (AM) and Mamba-Attention (MA)). The pretrained models were fine-tuned on downstream classification tasks using the COBRE and ADHD-200 datasets for schizophrenia and attention-deficit/hyperactivity disorder discrimination. We employed Integrated Gradients, an explainable AI method, to identify the brain regions contributing to model predictions. Masking strategy strongly influenced reconstruction behavior, with reconstruction loss following a consistent ordering (Any Majority Pure). However, this trend did not directly translate into downstream performance, where differences were modest and dataset-dependent. The hybrid architecture with the MA configuration achieved the highest average AUROC across both datasets, and Rhamba outperformed state-of-the-art methods in comparative evaluation. Region-wise analysis showed that peak performance depends on the interaction between masking strategy and architecture rather than a single dominant configuration. Overall, Rhamba offers a flexible framework for balancing interpretability, scalability, and performance in large-scale fMRI representation learning.
[AI-160] MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention
【速读】:该论文旨在解决当前数字音乐服务在情绪调节与心理压力缓解中面临的两大局限:一是依赖静态用户偏好,无法实时适应用户的即时心理状态;二是直接将脑电图(EEG)信号映射到音乐生成存在配对数据稀缺和可解释性不足的问题。其解决方案的关键在于提出一个闭环式实时系统——MindMelody,该系统通过引入“情绪介导的语义桥梁”实现从EEG到个性化音乐干预的动态转化:首先利用混合Transformer-图神经网络(GNN)解码实时EEG信号为全局效价-唤醒状态和局部时间情感轨迹;随后借助带检索增强生成(RAG)的大型语言模型(LLM)制定结构化干预计划;再通过新型分层EEG控制器将全局情感前缀与局部时序引导注入预训练音乐基座模型,实现细粒度可控音频合成;最后结合连续反馈机制根据用户动态EEG变化实时调整生成参数,从而显著提升控制一致性、情感匹配度及用户感知帮助性。
链接: https://arxiv.org/abs/2605.01235
作者: Yimeng Zhang,Yueru Sun,Haoyu Gu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users’ instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user’s evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.
[AI-161] Zero-Shot Signal Temporal Logic Planning with Disjunctive Branch Selection in Dynamic Semantic Maps
【速读】:该论文旨在解决信号时序逻辑(Signal Temporal Logic, STL)规划在动态环境中的零样本泛化难题,即如何在不重新训练模型的前提下,为不同地图配置(variable-map environments)生成满足STL任务规范的可行轨迹。其解决方案的关键在于:1)引入一种地图条件化的Transformer架构,结合轻量级启发式策略,有效处理复杂析取(disjunctive, OR)子公式;2)利用传递强化学习(Transitive Reinforcement Learning, TRL)确保分解后的子任务在时间上具有一致性和逻辑连贯性,从而实现对多样障碍布局下STL规范的高效、鲁棒执行。
链接: https://arxiv.org/abs/2605.01222
作者: Bowen Ye,Ancheng Hou,Junyue Huang,Ruijia Liu,Xiang Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Signal Temporal Logic (STL) offers verifiable task specifications and is crucial for safety-critical control. Yet STL planning remains challenging: exact optimization-based methods are often too slow, and learning-based methods struggle to generalize across varying environments. We propose a zero-shot STL planning solver for variable-map environments that generates feasible trajectories without retraining. By integrating a map-conditioned Transformer architecture with a lightweight heuristic, our approach effectively handles complex disjunctive (OR) subformulas. Furthermore, we leverage Transitive Reinforcement Learning (TRL) to ensure consistent temporal grounding and logical coherence across decomposed sub-tasks. Experiments on dynamic semantic maps with diverse obstacle layouts demonstrate consistent gains, highlighting the framework’s superior zero-shot generalization to changing environments and broad STL coverage.
[AI-162] Agent ic AI Systems Should Be Designed as Marginal Token Allocators
【速读】:该论文试图解决当前生成式 AI (Generative AI) 系统设计与评估中缺乏统一经济视角的问题,即各组件(如路由、代理决策、服务栈和训练流水线)被孤立设计,导致整体资源分配效率低下。其解决方案的关键在于将整个系统视为一个“边际 token 分配经济”(marginal token allocation economy),强调所有层级均需满足相同的最优条件:边际收益等于边际成本加上延迟成本和风险成本。这一框架揭示了局部优化为何会导致全局资源错配,并预测了常见故障模式(如过度路由、过度委托、验证不足等),进而提出以 token-aware 评估、自主性定价、拥塞定价的服务机制及风险调整的强化学习预算作为具体研究方向。
链接: https://arxiv.org/abs/2605.01214
作者: Siqi Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:This position paper argues that agentic AI systems should be designed and evaluated as \emphmarginal token allocation economies rather than as text generators priced by the unit. We follow a single request – a developer asking a coding agent to fix a failing test – through four economic layers that today are designed in isolation: a router that decides which model answers, an agent that decides whether to plan, act, verify, or defer, a serving stack that decides how to produce each token, and a training pipeline that decides whether the trace is worth learning from. We show that all four layers are solving the \emphsame first-order condition – marginal benefit equals marginal cost plus latency cost plus risk cost – with different index sets and different prices. The framing is deliberately minimal: we do not propose a complete theory of AI economics. But adopting marginal token allocation as the shared accounting object explains why systems that locally minimize tokens globally misallocate them, predicts a small set of recurring failure modes (over-routing, over-delegation, under-verification, serving congestion, stale rollouts, cache misuse), and points to a concrete research agenda in token-aware evaluation, autonomy pricing, congestion-priced serving, and risk-adjusted RL budgeting.
[AI-163] Faithful Mobile GUI Agents with Guided Advantage Estimator
【速读】:该论文旨在解决基于视觉-语言模型(Vision-Language Model, VLM)的图形用户界面(GUI)代理在交互中表现出不忠实行为的问题,即代理倾向于依赖记忆中的捷径而非基于屏幕显示证据或用户指令进行决策。解决方案的关键在于提出一种以“忠实性优先”(faithfulness-first)的框架——Faithful-Agent,其核心是采用两阶段流水线:第一阶段为面向忠实性的监督微调(SFT),使代理在面对输入证据扰动时采取回避行为;第二阶段引入引导优势估计器(GuAE),这是一种基于锚点且自适应方差的优势调节机制,结合GRPO优化策略,有效缓解稀疏奖励下低方差轨迹组的优势崩溃问题,并通过思想-动作一致性奖励进一步提升忠实性指标(Trap SR从13.88%提升至80.21%),同时保持对复杂指令的良好遵循能力。
链接: https://arxiv.org/abs/2605.01208
作者: Haowen Hu,Pengzhou Cheng,Zheng Wu,Lingzhong Dong,Gongshen Liu,Zhuosheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language model based graphical user interface (GUI) agents have shown strong interaction capabilities. However, they often behave unfaithfully, relying on memorized shortcuts rather than grounding actions in displayed screen evidence or user instructions. To address this, we propose Faithful-Agent, a faithfulness-first framework that reformulates GUI interaction to prioritize evidence groundedness and internal consistency. Faithful-Agent employs a two-stage pipeline: (i) a faithfulness-oriented SFT stage to instill abstainment behaviors under evidence perturbations; (ii) an RFT stage that further amplifies faithfulness by introducing the guided advantage estimator (GuAE), an anchor-based and variance-adaptive advantage tempering mechanism built upon GRPO. GuAE prevents advantage collapse in low-variance rollout groups under sparse GUI rewards, and with a thought-action consistency reward, Faithful-Agent (Stage II) elevates the Trap SR from 13.88% to 80.21% relative to the baseline, while preserving robust general instruction-following performance.
[AI-164] NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
【速读】:该论文旨在解决临床人工智能(AI)应用中因高性能模型的“黑箱”或“灰箱”特性而导致的专业级可解释性不足问题,即模型缺乏本体论基础(ontological grounding)和叙事透明性(narrative transparency)。其解决方案的关键在于提出一个神经符号系统 NEURON,该系统通过融合 SNOMED CT 本体驱动的结构化表示与机器学习模型,将原始数据映射至医学术语体系,从而提升预测可靠性;同时引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的大型语言模型(LLM)层,将 SHAP 特征归因与患者特异性临床笔记整合为连贯的自然语言解释,显著增强了人类对模型决策的理解与信任。
链接: https://arxiv.org/abs/2605.01189
作者: Anuradha Chandrasekaran,Dimitrios Zikos,Mutlu Mete,Alan Pang,Brady D. Lund,Kewei Sha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical AI adoption is hindered by the black-box/grey-box nature of high-performing models, which lack the ontological grounding and narrative transparency required for professional-level explainability. We present NEURON, a neuro-symbolic system designed to enhance both predictive reliability and clinical interpretability. NEURON integrates SNOMED CT ontology-informed structural representations with machine learning models to bridge the gap between raw data and medical nomenclature. To facilitate human-aligned interaction, the system utilizes a Retrieval-Augmented Generation (RAG) grounded LLM layer to synthesize SHAP feature attributions and patient-specific clinical notes into coherent, natural-language explanations. Validated on the MIMIC-IV dataset for Acute Heart Failure mortality prediction, NEURON improved the AUC from 0.74-0.77 to 0.84-0.88 and significantly outperformed raw SHAP visualizations in human-aligned metrics (0.85 vs. 0.50). Our results demonstrate that NEURON offers a robust, scalable engineering solution for deploying trustworthy, human-centered connected health applications.
[AI-165] Minimizing Collateral Damage in Activation Steering
【速读】:该论文旨在解决激活控制(activation steering)中因标准干预方法(如向量加法)导致的“旁系损伤”(collateral damage)问题,即在调整目标特征方向时,无意中改变了其他非目标特征方向上的激活对齐,从而引发模型在无关任务上的性能下降。解决方案的关键在于将激活控制建模为一个约束优化问题,通过引入基于激活经验二阶矩矩阵(empirical second-moment matrix)的加权项来最小化预期平方旁系损伤,从而实现对不同特征方向扰动成本的非均匀建模,相较于传统各向同性惩罚策略,显著提升了控制精度并降低了对模型整体性能的影响。
链接: https://arxiv.org/abs/2605.01167
作者: Tam Nguyen,Tu Anh Nguyen,Sina Alemohammad,Richard G. Baraniuk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. However, standard interventions, such as vector addition, often cause ``collateral damage", defined as unintended changes in the alignment of activations along other non-target feature directions. This damage occurs because standard methods implicitly assume the isotropy of non-target features. In this work, we provide a mathematical formalization of collateral damage and introduce a principled framework that models steering as a constrained optimization problem. Our method finds a new activation that minimizes the expected squared collateral change weighted by the empirical second-moment matrix of activations. This weighting encodes the nonuniform cost of the perturbation in different feature directions, in contrast to isotropic approaches that penalize changes uniformly in all feature directions. By accounting for the empirical second-moment of activations, our approach achieves more precise control while reducing the degradation of model performance on unrelated tasks.
[AI-166] LLM s Should Not Yet Be Credited with Decision Explanation
【速读】:该论文试图解决的问题是:当前研究中存在将大型语言模型(Large Language Models, LLMs)误认为具备决策解释能力的现象,这可能导致对“解释性进展”定义的过早泛化,从而削弱人类决策建模中真正解释力的标准。解决方案的关键在于提出一个“桥接标准”(bridge standard)用于判定LLMs是否应被赋予决策解释信用,该标准要求更强的主张必须明确解释目标、区分弱于其自身的理性化替代方案、采用针对目标的过程敏感或干预敏感的验证方法,并限定适用范围。这一标准旨在保留LLMs在行为预测、理由生成和假设提出方面的价值,同时避免对其解释能力做出不恰当的夸大,最终通过“信用校准原则”——即LLMs仅应获得其证据所能支持的最强声明——推动其从说服性的决策叙述者转变为更可靠的解释发现、测试与传播工具。
链接: https://arxiv.org/abs/2605.01164
作者: Wenshuo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This position paper argues that LLMs should not yet be credited with decision explanation. This matters because recent work increasingly treats accurate behavioral prediction, plausible rationales, and outcome-conditioned reasoning traces as evidence that LLMs explain why people decide as they do, risking a premature redefinition of what counts as explanatory progress in human decision modeling. We first distinguish three claims with different evidential burdens: decision prediction, rationale generation, and decision explanation. We then argue that the evidence most commonly offered for LLM-based decision accounts directly supports the first two claims, and sometimes explanatory hypothesis generation, but does not distinguish decision explanation from prediction-supportive rationalization. Next, we propose a bridge standard for decision-explanation credit: stronger claims should specify explanatory targets, discriminate against weaker rationalizer alternatives, use target-appropriate process- or intervention-sensitive validation, and bound their scope. We then situate this standard against competing views and related literatures, clarifying why it preserves the value of LLMs as predictors, narrators, and hypothesis generators while resisting premature explanatory credit. We conclude with a principle of credit calibration: LLMs should be credited for the strongest claim their evidence warrants, and no stronger; if adopted, this principle can help turn LLMs from persuasive narrators of decisions into more reliable instruments for discovering, testing, and communicating explanations of human behavior.
[AI-167] he Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development
【速读】:该论文试图解决生成式 AI(Generative AI)在软件开发中引发的“生产力-可靠性悖论”(Productivity-Reliability Paradox, PRP)问题,即AI编码助手在提升任务完成数量的同时,却导致代码审查时间显著延长、交付效率未见改善甚至下降的现象。其解决方案的关键在于识别出“规范治理”(Specification Governance)是制约AI辅助软件可靠性的核心约束,而非模型能力本身;为此,论文提出基于交易成本经济学的 Specification Governance Model(SGM),并通过 Spec Kit 和 TDAD 两个实例验证其有效性,强调通过结构化规范制定与治理机制来缓解非确定性代码生成带来的系统性风险。
链接: https://arxiv.org/abs/2605.01160
作者: Sabry E. Farrag
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 4 tables, 1 figure, 71 references
Abstract:Since 2022, AI-powered coding assistants have produced contradictory evidence: controlled studies report 20-56% productivity gains on well-scoped tasks, while the most rigorous RCT documents a 19% slowdown for experienced developers, and telemetry across 10,000+ developers shows 98% more pull requests but 91% longer review times with flat delivery metrics. This paper argues these findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline. Through a multivocal literature review of 67 sources (2022-2026), this paper: (1) formally defines the PRP with three moderating variables (task abstraction, codebase maturity, developer experience) and two amplifying mechanisms (code review bottleneck, context window constraint); (2) proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers; (3) introduces the Specification Governance Model (SGM), grounded in Transaction Cost Economics, with a practical governance decision guide; and (4) evaluates Spec Kit and TDAD as SGM instantiations via a four-month pilot study. Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.
[AI-168] Multi-Perspective Transformers in ARC-AGI-2 Challenge
【速读】:该论文旨在解决机器在面对人类直觉型视觉谜题时的泛化能力、符号意义理解能力以及规则灵活应用能力的评估问题,其核心挑战在于模型如何从少量示例中学习并适应不同情境下的规则变化。解决方案的关键在于使用TinyLM作为基础模型,并引入测试时训练(Test-Time Training, TTT)和专家乘积(Products of Experts, POE)进行额外微调,从而提升模型在ARC-AGI-2基准上的表现,实现96.1%的训练集准确率和21.7%的评测集准确率。
链接: https://arxiv.org/abs/2605.01154
作者: Caleb Talley,Vedant Tibrewal,Seun Adekunle,Weiwen Dong,Xinyu Wu,Fariha Sheikh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:ARC-AGI-2 is a benchmark of human-intuitive visual puzzles that measures a machine’s ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts. In this paper, we discuss our approach to solving the ARC-AGI-2 puzzles with TinyLM, with additional fine-tuning at test time, including Test-Time-Training (TTT) and Products of Experts (POE). Our model achieves 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set.
[AI-169] Position: Safety and Fairness in Agent ic AI Depend on Interaction Topology Not on Model Scale or Alignment
【速读】:该论文试图解决的问题是:当前主流假设认为,单个大语言模型(Large Language Models, LLMs)的安全属性可以组合成多智能体系统的安全行为,但这一假设在实际中存在根本性错误。论文指出,在代理型人工智能(Agentic AI)场景下,系统安全性并非由模型权重决定,而是由交互拓扑结构(interaction topology)主导,尤其在顺序决策或并行投票等机制中,信息流和决策耦合结构显著影响最终结果。解决方案的关键在于:将 agentic AI 视为一个动力系统(dynamical system),而非孤立对齐组件的集合,并将交互拓扑作为安全评估与监管的核心对象,要求系统在部署前展示对不同架构变体的鲁棒性。
链接: https://arxiv.org/abs/2605.01147
作者: Tanav Singh Bajaj,Nikhil Singh,Karan Anand,Eishkaran Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures. Position paper
Abstract:As large language models are increasingly deployed as interacting agents in high-stakes decisions, the AI safety community assumes that safety properties of individual models will compose into safe multi-agent behavior. This position paper argues that this assumption is fundamentally mistaken. In agentic AI, safety is determined by interaction topology, not model weights. When agents deliberate sequentially or aggregate via parallel voting with a judge, the structure of information flow and decision coupling dominates outcomes. Evidence across model families and scales reveals three persistent topology-driven pathologies: ordering instability, where system behavior depends primarily on agent sequence; information cascades, where early judgments propagate regardless of correctness; and functional collapse, where systems satisfy fairness metrics while abandoning meaningful risk discrimination. Contrary to intuition, scaling to more capable models strengthens these effects by increasing consensus formation and reducing the challenge of initial decisions. These failure modes are invisible to model-centric evaluation and alignment procedures. We argue that agentic AI must be treated as a dynamical system rather than a collection of aligned components. Interaction topology must become a primary target of safety evaluation and regulation, with systems required to demonstrate robustness across architectural variations before deployment.
[AI-170] A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM -Powered Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在自主执行任务过程中因过度自治而引入的新攻击面问题,尤其是通过直接提示注入、间接内容攻击及多轮交互升级策略等手段对代理行为进行操纵的风险。现有防御方法主要依赖提示层过滤和规则型护栏,难以应对风险在交互序列中逐步累积的情况。其解决方案的关键在于提出一种低延迟的欺诈检测层,该层不依赖单个提示的恶意性判断,而是基于结构化的运行时特征(包括提示特性、会话动态、工具使用、执行上下文及欺诈启发信号)对交互轨迹建模,从而识别潜在的对抗性交互模式;该检测层可采用轻量级模型实现,具备实时部署能力,实验表明其速度比基于LLM的检测器快9倍以上,验证了交互级行为检测应成为LLM驱动智能体部署阶段的核心防御组件。
链接: https://arxiv.org/abs/2605.01143
作者: Sheldon Yu,Yingcheng Sun,Hanqing Guo,Julian McAuley,Qianqian Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-powered agents demonstrate strong capabilities in autonomous task execution, tool use, and multi-step reasoning. However, their increasing autonomy also introduces a new attack surface: adversarial interactions can manipulate agent behavior through direct prompt injection, indirect content attacks, and multi-turn escalation strategies. Existing defense strategies focus on prompt-level filtering and rule-based guardrails, which are often insufficient when risk emerges gradually across interaction sequences. In this work, we propose a complementary defense mechanism: a low-latency fraud detection layer for detecting adversarial interaction patterns in LLM-powered agents. Instead of determining whether a single prompt is malicious, our approach models risk over interaction trajectories using structured runtime features derived from prompt characteristics, session dynamics, tool usage, execution context, and fraud-inspired signals. The detection layer can be implemented using lightweight models leading to low-latency real-time deployments. To evaluate the framework, we construct a synthetic corpus of 12,000 multi-turn agent interactions generated from parameterized templates that simulate realistic agentic workflows. Using 42 structured features and an XGBoost classifier, our detector achieves over 9 times faster than LLM-based detectors. Through the experiment and ablation studies, our work suggests that interaction-level behavioral detection should become a core component of deployment-time defense for LLM-powered agents.
[AI-171] o Use AI as Dice of Possibilities with Timing Computation
【速读】:该论文旨在解决当前以名词(noun-based)为核心建模范式的局限性,即这种范式无法有效表征未来作为开放时间维度的能力,从而制约了人工智能(AI)在模拟人类思维语法方面的潜力。其解决方案的关键在于提出一种基于动词(verb-based)的新范式,并精确定义了“时序计算”(timing computation)与“可能性”(possibility),使AI能够从纵向电子健康记录(EHR)数据中自动识别具有临床意义的患者轨迹,并进行反事实时序推断。该方法完全基于数据驱动,无需预先领域知识,在机器学习文献中首次实现了此类能力。
链接: https://arxiv.org/abs/2605.01134
作者: Jia Li,Vipin Kumar,Rui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The dominant noun-based modeling paradigm has fundamentally constrained AI development, precluding any adequate representation of the future as an open temporal dimension. This paper introduces a verb-based paradigm, together with precise definitions of \emphtiming computation and \emphpossibility, that enables AI to function as an effective instrument for realizing the grammar of our thought. Applied to longitudinal EHR data from 3,276 breast cancer patients, the framework empirically demonstrates: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction. Both results are purely data-driven, require no prior domain knowledge, and, to our knowledge, represent the first such demonstrations in the machine learning literature. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.01134 [cs.AI] (or arXiv:2605.01134v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.01134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-172] Forag er: a lightweight testbed for continual learning with partial observability in RL
【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, CRL)中因环境非平稳性和部分可观测性导致的模型遗忘问题(loss of plasticity),并强调现有研究对部分可观测性的忽视。其核心挑战在于如何在保持长期学习能力的同时,有效应对动态变化且信息不完整的环境。解决方案的关键在于引入一个轻量级、具有恒定内存占用的部分可观测CRL环境Forager,通过实验证明:尽管当前CRL算法仍存在性能退化,但利用状态构建(state construction)策略能显著提升适应能力,尤其在生成无限新任务流的变体中凸显了现有方法的局限性。
链接: https://arxiv.org/abs/2605.01131
作者: Steven Tang,Xinze Xiong,Anna Hakhverdyan,Andrew Patterson,Jacob Adkins,Jiamin He,Esraa Elelimy,Parham Mohammad Panahi,Martha White,Adam White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures
Abstract:In continual reinforcement learning (CRL), good performance requires never-ending learning, acting, and exploration in a big, partially observable world. Most CRL experiments have focused on loss of plasticity – the inability to keep learning – in one-off experiments where some unobservable non-stationarity is added to classic fully observable MDPs. Further, these experiments rarely consider the role of partial observability and the importance of CRL agents that use memory or recurrence. One potential reason for this focus on mitigating loss of plasticity without considering partial observability is that many partially-observable CRL environments are prohibitively expensive. In this paper, we introduce Forager, a light-weight partially-observable CRL environment with a constant memory footprint. We provide a set of experiments and sample tasks demonstrating that Forager is challenging for current CRL agents and yet also allows for in-depth study of those agents. We demonstrate that agents exhibit loss of plasticity, proposed mitigations can help, but that most useful is to leverage state construction. We conclude with a variant of Forager that generates an unending stream of new tasks to learn that clearly highlights the limitations of current CRL agents.
[AI-173] Iterative Finetuning is Mostly Idempotent
【速读】:该论文旨在探究模型在持续自我迭代训练过程中,其初始行为倾向(如谄媚倾向或对齐偏差)是否会因使用自身输出作为训练数据而被放大。研究通过多轮微调实验,对比了监督微调(Supervised Fine-Tuning, SFT)、合成文档微调(Synthetic Document Fine-Tuning, SDF)和直接偏好优化(Direct Preference Optimization, DPO)三种设置下行为特性的演化规律。关键发现是:在非强化学习(non-RL)的微调方法中,特性放大极为罕见且高度依赖数据量,几乎不会意外发生;而在DPO设置中,若模型在每轮训练中持续使用自身输出进行偏好优化,则特性可稳定放大;但若每轮重新初始化模型,则放大现象消失。因此,解决方案的关键在于限制持续后训练(continual post-training)阶段,这可能是抑制有害行为放大的有效策略。此外,放大与连贯性之间的权衡关系也构成了天然的防御机制。
链接: https://arxiv.org/abs/2605.01130
作者: Zephaniah Roe,Jack Sanderson,Dang Nguyen,Julian Huang,Todd Nief,Aryan Shrivastava,Chenhao Tan,Ari Holtzman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:If a model has some behavioral tendency, such as sycophancy or misalignment, and it is trained on its own outputs, will the tendency be amplified in the next generation of models? We study this question by training a series of models where each model is finetuned on data generated by its predecessor, and the initial model is seeded with some persona or belief. We test three settings: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and direct preference optimization (DPO). In the SFT and SDF settings, traits mostly decay or remain constant so that further finetuning cycles do nothing. In rare cases when amplification occurs, it generally comes at the cost of coherence. In the DPO setting, trait amplification can reliably occur when a model is continually trained with a preference for its own outputs, but vanishes when models are reinitialized at each cycle. Overall, our results suggest that amplification most likely comes from continual post-training, and limiting this stage may be an effective defense. For non-RL finetuning, trait amplification is rare and very sensitive to data quantity, making it significantly less likely to occur accidentally. Finally, the amplification-coherence tradeoff serves as a natural deterrent against trait amplification.
[AI-174] PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLM s ACL-2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中生成自动反馈时,如何在保持内容诊断准确性的同时,将其输出风格与特定教师的评语语气和结构对齐的问题。解决方案的关键在于提出一种名为PERSA的强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)流程:它通过监督微调获取教授示范、基于成对偏好构建奖励模型,并采用近端策略优化(Proximal Policy Optimization, PPO)进行训练;同时,受Transformer内部机制启发,PERSA采用参数高效微调策略,仅更新顶层Transformer块及其前馈投影层,从而最小化全局参数漂移并提升风格可控性。实验表明,该方法在多个代码反馈基准(APPS、PyFiXV、CodeReviewQA)上显著提升了风格一致性(如SAC达96.2%),同时维持100%的内容正确率,实现了“说什么”与“怎么说”的双重个性化。
链接: https://arxiv.org/abs/2605.01123
作者: Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 7 tables, accepted to conference ACL-2026, BEA
Abstract:Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
[AI-175] New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search
【速读】:该论文致力于解决组合数学中的Zarankiewicz问题,即确定在不包含完全二部子图 Ks,t 的前提下,二部图 Gm,n 可能拥有的最大边数 Z(m,n,s,t)。这一问题在极值图论中具有重要意义,但精确解长期难以获得。本文首次精确求解了三个新的Zarankiewicz数:Z(11,21,3,3)=116、Z(11,22,3,3)=121 和 Z(12,22,3,3)=132,并为41个其他参数提供了更优的下界,其中多个结果逼近已知上界。解决方案的关键在于提出并实现了一种基于大语言模型(Large Language Models, LLMs)的进化算法框架 OpenEvolve,该框架通过定制化的奖励信号迭代优化生成数学构造的算法,从而高效发现高质量的极值图结构。实验表明,该方法计算成本极低(每组参数不到30美元),具备可重复性和易访问性,展示了LLM引导的进化搜索在数学研究中的强大潜力。
链接: https://arxiv.org/abs/2605.01120
作者: Jay Bhan,Nicole Nobili,Srinivasan Raghuraman,Patrick Langer
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注: *Jay Bhan and Nicole Nobili contributed equally to this work as first authors, and their order was determined via coin flip
Abstract:The Zarankiewicz number \textbfZ(m, n, s, t) is the maximum number of edges in a bipartite graph G_m, n such that there is no complete K_s, t bipartite subgraph. We determine for the first time the exact values of three Zarankiewicz numbers: \textbfZ(11, 21, 3, 3)=116 , \textbfZ(11, 22, 3, 3)=121 , and \textbfZ(12, 22, 3, 3)=132 . We further establish lower bounds for 41 more Zarankiewicz numbers, including several that are within one edge of the best known upper bound, and we match the established value in four more closed cases. Our results are obtained using OpenEvolve, an open-source evolutionary algorithm based on Large Language Models (LLMs) that iteratively improves algorithms for generating mathematical constructions by optimizing a reward signal which we tailored for this specific problem. These findings provide new extremal graph constructions and demonstrate the potential of LLM-guided evolutionary search to contribute to mathematical research. In addition to presenting the resulting constructions, we report the generation algorithms produced, describe the relevant implementation details, and provide our computational costs. Our costs are remarkably low, at less than \ 30 for each Zarankiewicz parameter combination, showing that LLM-guided evolutionary search can be an inexpensive, reproducible, and accessible tool for discovering new combinatorial constructions.
[AI-176] owards Multi-Agent Autonomous Reasoning in Hydrodynamics
【速读】:该论文旨在解决单智能体系统(Single-agent System, SAS)在大语言模型(LLM)驱动的科学工作流中因上下文窗口有限而导致的“上下文饱和”问题,即随着工具调用规范和观测轨迹积累,每个决策可用的有效上下文不断减少,进而影响端到端的可靠性。其解决方案的关键在于提出一种基于图结构的多智能体系统(Multi-agent System, MAS)原型,通过分层执行图(Layer Execution Graph, LEG)协调多个专业化智能体:规划器智能体根据自然语言路由启发式构建查询特定的执行拓扑,而非硬编码控制逻辑;专家智能体受限于工具白名单并承担互补的数据类别角色;层间聚合智能体将并行输出融合为简洁摘要,报告智能体最终合成响应,并全程记录工具调用的溯源信息以保障可审计性。此架构有效缓解了单体架构下的上下文瓶颈,实验证明其在多种复杂度查询下保持高精度与鲁棒性。
链接: https://arxiv.org/abs/2605.01102
作者: Jinpai Zhao,Albert Cerrone,Joannes Westerink,Clint Dawson
机构: 未知
类目: Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Single-agent systems (SAS) have become the default pattern for LLM-driven scientific workflows, but routing planning, tool use, and synthesis through a single context window comes with a well-known cost: as tool specifications and observational traces accumulate, the effective context available for each decision shrinks, and end-to-end reliability suffers. We present a multi-agent system (MAS) prototype for hydrodynamics in which specialized agents are coordinated through a Layer Execution Graph (LEG). A planner agent constructs query-specific execution topologies from natural-language routing heuristics that capture domain knowledge without hard-coding it as rigid control logic; specialist agents operate under strict tool allowlists and occupy complementary data-class roles. Between layers, consolidator agents fuse parallel outputs into concise briefs, and a reporter agent synthesizes the final response, while the runtime logs provenance for every tool invocation to support auditability. All benchmarks, ablations, and stress tests use Claude Sonnet~4.6 as the backbone model for both specialist and general-purpose agents. Evaluated on 37 queries spanning six complexity categories, the prototype achieves 93.6% factual precision with a 100% pass rate. Accuracy remains above 90% across runs from single-threaded to five independent parallel tracks, and under simulated loss of individual data sources the system degrades gracefully, still returning substantive partial answers. Together, these results suggest that planner-guided, graph-structured multi-agent orchestration can meaningfully alleviate the context-saturation bottlenecks that constrain monolithic single-agent architectures.
[AI-177] A Knowledge-Driven LLM -Based Decision-Support System for Explainable Defect Analysis and Mitigation Guidance in Laser Powder Bed Fusion
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)中激光粉末床熔融(Laser Powder Bed Fusion, LPBF)工艺缺陷诊断与治理缺乏可解释性、系统性和知识驱动支持的问题。解决方案的关键在于构建一个基于本体(ontology)的知识驱动型大语言模型(Large Language Model, LLM)决策支持系统,将27类已知LPBF缺陷按层次化类别和因果关系组织成结构化知识库,并结合模糊自然语言查询、文献支持的缺陷解释以及过程知识编码的缓解策略生成能力;同时引入基于基础模型(foundation models)的多模态图像评估模块,实现描述符引导下的微观缺陷图像语义对齐分析,从而提升缺陷识别与治理建议的一致性、可解释性与实用性。
链接: https://arxiv.org/abs/2605.01100
作者: Basit Mahmud Shahriar,Md Habibor Rahman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 15 figures
Abstract:This work presents a knowledge-driven decision-support system that integrates structured defect knowledge with LLM-based reasoning to provide explainable defect diagnosis and mitigation guidance in manufacturing, using LPBF as a representative, safety-critical case study. The proposed ontology-integrated LLM-based decision support system for LPBF defect analysis and mitigation guidance is built on a knowledge base containing 27 known LPBF defect types organized into hierarchical categories and causal relationships. The developed system supports fuzzy natural language queries for systematic knowledge retrieval, literature-supported explanation of defects, and guidance on defect causes and mitigation strategies derived from encoded process knowledge. Furthermore, a multimodal image-assessment module based on foundation models enables descriptor-guided interpretation of representative microscopic defect images through semantic alignment scoring. The proposed framework was evaluated through qualitative comparisons with general-purpose vision-language models, an ablation study, and an inter-rater reliability analysis. Evaluation on the literature-derived dataset showed that the fully integrated configuration outperformed the other three evaluated system configurations, achieving a macro-average F1 score of 0.808. Additionally, inter-rater reliability analysis using Cohen’s kappa indicated substantial agreement between the model outputs and the literature-derived reference labels. These findings suggest that ontology-guided knowledge representation can improve the consistency, interpretability, and practical usefulness of LLM-assisted LPBF defect analysis.
[AI-178] A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
【速读】:该论文旨在解决生成式 AI(Generative AI)在检索增强生成(Retrieval-Augmented Generation, RAG)和工具集成大语言模型(Large Language Model, LLM)代理中,因依赖外部文本源而引入的恶意指令注入攻击问题。此类攻击可触发模型产生非预期行为,现有基于LLM的检测方法和训练型防御策略存在易受优化攻击和泛化能力差的缺陷。解决方案的关键在于提出SONAR框架,其通过构建用户查询与外部数据之间的句级关系图,利用自然语言推理(Natural Language Inference, NLI)中的蕴含(entailment)与矛盾(contradiction)得分作为边权重,识别偏离核心任务的句子,并采用基于连通性的剪枝策略移除被标记的注入种子及其相关邻接节点,同时保留良性上下文,从而实现对恶意内容的有效净化。
链接: https://arxiv.org/abs/2605.01078
作者: Soumil Datta,Melissa Umble,Daniel S. Brown,Guanhong Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation and tool-integrated LLM agents increasingly depend on external textual sources. This reliance broadens the available attack surface, allowing adversaries to insert malicious instructions that trigger unintended model behaviors. Current defensive measures often utilize LLM-based detectors to filter such content, but these approaches remain vulnerable to optimization-based attacks. Additionally, training-based methods frequently fail to generalize to novel data distributions. To resolve these issues, we introduce SONAR, a prompt sanitization framework that identifies and removes injected content using metrics from natural language inference. Specifically, SONAR constructs a sentence-level relational graph across the user query and external data. By using entailment and contradiction scores as edge weights, the system identifies sentences that deviate from the core task. It then employs connectivity-driven pruning to eliminate flagged injection seeds and their related neighbors while maintaining benign context. Rigorous evaluations across several models and datasets show that SONAR reduces the attack success rate to nearly zero, significantly outperforming nine established baseline defenses.
[AI-179] SCION: Size-aware Policy Orchestration for Nonstationary Object Caches (Long Paper Version)
【速读】:该论文旨在解决云和边缘服务中对象缓存(object cache)在异构、非平稳且吞吐受限的生产工作负载下,如何实现高效缓存策略选择的问题。现有简单非机器学习(non-ML)策略如SIEVE和S3-FIFO已构成强基线,因此任何学习型方法必须具备低开销、对数据漂移鲁棒,并能与最优专家策略竞争。解决方案的关键在于提出SCION框架——一种轻量级策略编排机制,通过计算部署在关键路径外的极小工作负载指纹(tiny workload fingerprint),从一组可部署缓存策略(包括GDSF、S3-FIFO、SIEVE、LHD、W-TinyLFU-AV和DynamicAdaptiveClimb)中动态选择最优策略;其原型AUTO利用对象大小、缓存性、重用性和缓存容量的短前缀统计信息,结合离线训练的线性选择器进行决策,实现了在多数工作负载上优于SIEVE的缓存命中率(cacheable-only object miss ratio),同时保持接近最佳单一专家策略的平均性能,并支持显式的OMR/BMR权衡选择。
链接: https://arxiv.org/abs/2605.01055
作者: Qizhi Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 26 tables. Code repository: this https URL
Abstract:Object caches underpin cloud and edge services, but production workloads are heterogeneous, nonstationary, and throughput-constrained. Recent simple non-ML policies such as SIEVE and S3-FIFO set a strong baseline, so any learned method must be overhead-aware, robust under drift, and competitive with strong experts. We present SCION, a lightweight policy-orchestration framework that selects among a small set of deployable cache policies using a tiny workload fingerprint computed off the critical path. Our prototype, AUTO, uses short-prefix statistics of object size, cacheability, reuse, and cache size, then applies an offline-trained linear selector to choose among GDSF, S3-FIFO, SIEVE, LHD, W-TinyLFU-AV, and DynamicAdaptiveClimb; a simpler SCION-P90 variant uses only a p90 threshold. In a CPU-only, trace-driven evaluation on 30 public object-cache traces and a separate HR-Cache simulator subset, AUTO improves cacheable-only object miss ratio over SIEVE on a majority of workloads, stays close to the best single expert on average, enables explicit OMR/BMR tradeoff selection, and remains competitive on byte miss ratio. Under a fast-policy budget, AUTO-fast achieves lower cost than the best fixed fast policy. SCION reduces regime-mismatch risk while keeping the hot path unchanged.
[AI-180] Value Functions for Temporal Logic: Optimal Policies and Safety Filters
【速读】:该论文旨在解决在无折扣无限时域设定下,基于Q函数的贪婪策略在处理复杂任务(如嵌套Until、Globally和Globally-Until等时序逻辑(Temporal Logic, TL)规范)时可能出现的病理行为问题——即即使价值函数达到最优,贪婪策略仍可能无限期推迟任务完成。解决方案的关键在于:利用近期关于时序逻辑价值函数分解为子价值函数图的研究成果,构造依赖状态历史的非马尔可夫策略(non-Markovian policies),从而避免上述路径依赖的延迟行为,并证明这些策略在嵌套Until、Globally和Globally-Until规范下的定量鲁棒性评分(quantitative robustness score)上是最优的。此外,论文还展示了Q函数如何作为安全过滤器,扩展了先前仅适用于简单避障或到达-避障任务的结果,以支持更复杂的TL规范。
链接: https://arxiv.org/abs/2605.01051
作者: Oswin So,William Sharpless,Sylvia Herbert,Chuchu Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Optimization and Control (math.OC)
备注:
Abstract:While Bellman equations for basic reach, avoid, and reach-avoid problems are well studied, the relationship between value optimality and policy optimality becomes subtle in the undiscounted infinite-horizon setting, particularly for more complicated tasks. Greedily maximizing the Q-function can produce policies that indefinitely defer task completion for reach-avoid problems, or equivalently, Until specifications, even when the value function is optimal. Building upon recent results decomposing the value function for temporal logic (TL) into a graph of constituent value functions, we construct non-Markovian policies based on state history that avoid this pathology and prove their optimality with respect to the quantitative robustness score for nested Until, Globally, and Globally-Until specifications. We further show how the Q function can serve as a safety filter for complex TL specifications, extending prior results beyond simple avoid or reach-avoid tasks.
[AI-181] Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation
【速读】:该论文旨在解决认知工作流系统中治理执行的不安全性问题,即如何将原本依赖运行时约定的治理机制转变为具有结构化能力边界的强制性保障。此前的三层治理架构虽能证明治理完整性、溯源完整性和不可规避性(ungoverned effects),但其有效性依赖于纯模块约束——即步骤执行器不得产生副作用,而该约束通过模块导入图分析实现,易受BEAM虚拟机上的对抗性绕过攻击。解决方案的关键在于提出一种认证纯净性架构(certified purity architecture),包含四大核心机制:(1) 限制WebAssembly编译目标以结构性移除副作用指令;(2) 纯净性证书(purity certificates),通过密码学签名绑定执行器二进制与其导入分类;(3) 运行时验证门控机制,在执行器进入治理流水线前拒绝未认证实体;(4) 基于远程证明的可移植治理凭证,支持跨组织验证。论文形式化证明了四项定理,确保结构纯净性、消除五类BEAM绕过路径、证书完整性及门控完备性,且整体保证基于显式的可信计算基(Trusted Computing Base)。
链接: https://arxiv.org/abs/2605.01037
作者: Alan L. McCann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 23 pages, 4 figures, 8 tables. Companion proofs: this https URL . Project: this https URL
Abstract:We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three-layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect-producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross-organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39–42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.
[AI-182] Algebraic Semantics of Governed Execution: Monoidal Categories Effect Algebras and Coterminous Boundaries
【速读】:该论文旨在解决程序执行过程中治理(governance)的建模与形式化问题,即如何在保证安全性、透明性和适当性的前提下,实现程序行为的可验证控制与组合性。其核心挑战在于将治理机制嵌入到计算模型中,使其不仅具备逻辑一致性,还能与程序表达能力严格对应。解决方案的关键在于提出了一种代数语义框架——GovernanceAlgebra,通过三个公理(安全性、透明性、适当性)构建一个对称单子范畴(symmetric monoidal category),并利用交互树(interaction trees)和参数化共归纳(parameterized coinduction)进行形式化实现。该框架确保所有张量组合均保持治理不变,并通过代数效应系统约束处理程序(handler)只能构造出符合治理要求的形式;进一步地,通过能力索引的组合包(capability-indexed composition)与双重保证定理(dual guarantee theorem),实现了运行时治理与静态能力边界的同步验证。最终,该模型确立了“同界边界”(coterminous boundary):所有可通过基本态射构造器表达的程序皆受治理,且每个受治理程序均可由这类构造器映射得到,从而在保留图灵完备性的同时排除未经中介的I/O操作,使治理成为程序表达的本质属性。
链接: https://arxiv.org/abs/2605.01032
作者: Alan L. McCann
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注: 26 pages, 1 figure, 1 table. Companion proofs: this https URL . Project: this https URL
Abstract:We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.
[AI-183] Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation Expressive Minimality and Decidability Boundaries
【速读】:该论文旨在解决生成式 AI (Generative AI) 工作流架构中效应级治理(effect-level governance)与内部计算表达能力之间潜在冲突的问题,即如何在不牺牲程序内部计算灵活性的前提下实现对副作用(如内存访问、外部调用和大语言模型(LLM)查询)的有效控制。解决方案的关键在于形式化定义了一个治理算子 G,该算子通过 Interaction Trees 在 Rocq 8.19 中建模并介入所有效应性指令,并证明了七个核心性质:包括受治理的图灵完备性(P1)、受治理的预言机表达能力(P2)、治理谓词的可判定边界(P3)、允许执行下的目标保持性(P4)、原始能力的表达最小性(P5)、结构治理严格优于内容过滤的非对称性(P6),以及语义透明性(P7)——即在治理允许的所有执行路径上,受治理解释与未治理解释在观测层面等价(仅差治理独有事件)。这些结果表明,治理与计算表达能力是正交维度:治理仅约束程序的效应边界,而对内部计算语义保持透明。
链接: https://arxiv.org/abs/2605.01030
作者: Alan L. McCann
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注: 15 pages. Companion proofs: this https URL . Project: this https URL
Abstract:We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.
[AI-184] Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning
【速读】:该论文旨在解决现有自监督学习方法在处理多源生物信号时忽视时序方向性动态的问题,即这些方法通常将来自身体不同部位的生物信号视为可互换的视图,而未考虑它们之间由生理过程驱动的有序时间关系。例如,心电图(ECG)与光电容积脉搏波描记法(PPG)之间存在明确的时序结构:ECG记录心脏电激活,PPG则反映因血管动力学延迟后的外周脉搏。解决方案的关键在于提出xMAE框架,通过引入跨模态掩码重建任务并强制时序顺序约束,使模型在预训练阶段学习到具有生理意义的时间结构,从而提升下游任务性能。实验表明,该方法在19项任务中的15项上优于单模态和多模态基线,并具备跨设备、体位和采集条件的良好泛化能力。
链接: https://arxiv.org/abs/2605.00973
作者: Hao Zhou,Simon A. Lee,Cyrus Tanade,Keum San Chun,Juhyeon Lee,Migyeong Gwak,Megha Thukral,Justin Sung,Eugene Hwang,Mehrab Bin Morshed,Li Zhu,Viswam Nathan,Md Mahbubur Rahman,Subramaniam Venkatraman,Sharanya Arcot Desai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Proceedings of the 43rd International Conference on Machine Learning
Abstract:Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at this https URL.
[AI-185] Ablation Study of Multimodal Perception Language Grounding and Control for Human-Robot Interaction in an Object Detection and Grasping Task
【速读】:该论文旨在解决多模态人-机器人交互系统中各核心模块对端到端性能影响不明确的问题,尤其关注语言模型(language model)、感知系统(perception system)和控制器(controller)这三个关键组件的独立贡献与协同效应。解决方案的关键在于设计了一个受控的消融实验(controlled ablation study),在统一实验协议下系统性地比较三种语言模型、五种感知配置和三种控制器,并进一步对最优候选组合进行因子分析(factorial study),从而量化各模块对执行时间与成功率的影响,识别未来工程优化中最可能带来显著收益的方向。
链接: https://arxiv.org/abs/2605.00963
作者: Zi Tian,Guanting Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:This manuscript extends our previous multimodal human-robot interaction system by introducing a controlled ablation study of the three modules that most strongly influence end-to-end performance: the large language model used for action extraction, the perception system used for visual grounding, and the controller used for motion execution. The goal is not to redesign the full pipeline, but to isolate the contribution of each component under a common experimental protocol and then evaluate the best combinations end-to-end. We therefore compare three language models, five perception configurations, and three controllers, followed by a second-stage factorial study over the best candidates. The resulting analysis is intended to clarify which choices primarily affect execution time, which primarily affect success rate, and where the largest engineering gains are likely to come from in future revisions of the system.
[AI-186] E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在文档级成员推理攻击(Document-level Membership Inference Attack, MIA)的问题,即在黑盒环境下,攻击者仅通过查询响应交互即可推断某候选文档是否被纳入RAG的知识库,从而泄露敏感内容和知识库覆盖范围。现有方法要么依赖语义相似性等软信号导致区分度低,要么使用显式探测查询易被检测或拒绝。本文提出E-MIA,其核心创新在于将目标文档中的可验证硬证据(如细粒度细节、专有名词/技术术语、定义性陈述、元数据线索及因果/约束关系)转化为包含四种客观评分题型(填空题/FB、单选题/SC、多选题/MC、判断题/T/F)的“考试”,并以多个针对性问题的综合得分作为成员身份判别信号,实现了高区分度、隐蔽性强且无需显式探测的攻击效果。
链接: https://arxiv.org/abs/2605.00955
作者: Zelin Guan,Shengda Zhuo,Zeyan Li,Jinchun He,Wangjie Qiu,Zhiming Zheng,Shuqiang Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) equips large language models (LLMs) with external evidence by retrieving documents at inference time, but it also turns the retrieval corpusinto a sensitive asset. Under a black-box setting, an adversary given a candidate document can infer whether it has been ingested into the RAG knowledge base (i.e., document-level membership inference) solely from query response interactions, thereby leaking corpus coverage and the existence of sensitive topics. Existing RAG MIA methods either rely on soft signals such as semantic similarity, which often yield overlapping member/non-member score distributions and unstable thresholds, or employ explicit confirmation probes whose intent is conspicuous and thus prone to refusal and detection. We propose E-MIA, which converts verifiable hard evidence in the target document (e.g., fine-grained details, proper nouns/technical terms, definitional statements, metadata cues, and causal/constraint relations) into an exam with four objectively gradable question types (FB/SC/MC/T/F), and uses the aggregated exam score across multiple evidence targeted questions as the membership signal. Experiments across multiple datasets and diverse RAG configurations demonstrate that E-MIA improves member/non-member separability in stringent settings while preserving natural, stealthy queries, and we further analyze the impact of question composition and exam length on attack effectiveness.
[AI-187] Graph Rewiring in GNNs to Mitigate Over-Squashing and Over-Smoothing: A Survey IJCAI2026
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时面临的两个关键问题:过压缩(over-squashing)和过平滑(over-smoothing)。过压缩指远距离节点信息在传播过程中被过度压缩,导致信息丢失;过平滑则表现为多次消息传递后节点表示趋于相似,丧失区分能力。这两个问题均源于消息传递机制与输入图拓扑之间的相互作用,最终损害信息流动并限制GNN性能。论文提出通过图重布线(graph rewiring)技术作为解决方案,即主动调整图的拓扑结构以优化信息传播路径,从而缓解上述现象。其核心在于设计能够增强信息流效率的拓扑重构策略,在理论分析与实践实现之间取得平衡,并权衡不同方法在性能提升与计算开销之间的 trade-off。
链接: https://arxiv.org/abs/2605.00951
作者: Hugo Attali,Nathalie Pernelle,Davide Buscaldi,Fragkiskos D. Malliaros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2026), Survey Track
Abstract:Graph Neural Networks are powerful models for learning from graph-structured data, yet their effectiveness is often limited by two critical challenges: over-squashing, where information from distant nodes is excessively compressed, and over-smoothing, where repeated propagation makes node representations indistinguishable. Both phenomena stem from the interaction between message passing and the input topology, ultimately degrading information flow and limiting the performance of GNNs. In this survey, we examine graph rewiring techniques, a class of methods designed to modify the graph topology to enhance information propagation in GNNs. We provide a comprehensive review of state-of-the-art rewiring approaches, delving into their theoretical underpinnings, practical implementations, and performance trade-offs.
[AI-188] Interpretable experiential learning based on state history and global feedback
【速读】:该论文旨在解决资源受限环境下强化学习(Reinforcement Learning, RL)问题,即在计算资源有限的情况下实现高效、可解释的策略学习。其解决方案的关键在于提出一种基于状态历史与全局反馈的可解释经验学习模型,该模型通过构建状态集合间的转移图来表示行为策略,其中每个转移边附带效用值(utility)和证据计数(evidence count),从而在保证性能的同时提升模型的可解释性与适应性。
链接: https://arxiv.org/abs/2605.00940
作者: Anton Kolonin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 figures
Abstract:A new interpretable experiential learning model based on state history and global feedback is presented. It is capable of learning a behavioral model represented by a transition graph between sets of states, with transitions attributed with utility and evidence count. This model is expected to be suitable for solving reinforcement learning problem in resource-constrained environments. The model was thoroughly evaluated on the OpenAI Gym Atari Breakout benchmark, demonstrating performance comparable to some known neural network-based solutions.
[AI-189] From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
【速读】:该论文旨在解决传统幻觉检测方法在应对“顽固性幻觉”(Stubborn Hallucinations)时失效的问题,即大语言模型(LLM)在高置信度下产生错误信息的情形。其解决方案的关键在于提出一种几何感知的检测机制——嵌入扰动梯度敏感性(Embedding-Perturbed Gradient Sensitivity, EPGS),该方法通过向输入嵌入(embedding)添加高斯噪声并测量梯度幅值的突增来捕捉参数空间中损失函数的尖锐程度(sharpness),从而区分稳定知识(位于平坦极小值区域)与依赖脆弱记忆的不稳定幻觉(位于尖锐极小值区域)。EPGS作为海森矩阵谱(Hessian spectrum)的有效代理,显著优于基于熵和表征的基线方法,为高置信度事实性错误提供了鲁棒的检测信号。
链接: https://arxiv.org/abs/2605.00939
作者: Yee Zhing Liew,Andrew Huey Ping Tan,Anwar P.P Abdul Majeed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 8 tables
Abstract:Traditional hallucination detection fails on “Stubborn Hallucinations” – errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.
[AI-190] Fusing Urban Structure and Semantics: A Conditional Diffusion Model for Cross-City OD Matrix Generation
【速读】:该论文旨在解决城市通勤OD(Origin-Destination)矩阵生成中因个体意图、地理约束与社会动态共同作用导致的模式异质性问题,从而提升模型在不同城市间的泛化能力。解决方案的关键在于提出一种结构增强型扩散模型SEDAN,其核心创新是将城市建模为带属性的图结构:每个区域作为节点并附带人口统计学和兴趣点特征,通勤流作为加权边;通过融合邻接矩阵与距离矩阵来显式编码空间结构信息——邻接矩阵引导注意力机制强化相邻区域间交互,距离矩阵则作为扩散条件以捕捉空间接近性和出行阻力;同时利用基于图Transformer的节点交互建模潜在出行需求,实现语义信息与空间约束的协同建模,从而生成兼具行为合理性与地理一致性的OD矩阵。
链接: https://arxiv.org/abs/2605.00938
作者: Bin Chen,Zhuoya Meng,Fang Yang,Runkang Guo,Jingtao Ding,Yin Zhang,Chuan Ai,Zhengqiu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate modeling of commuting flows is important for urban governance, traffic planning, and resource allocation. However, the combined influence of individual intentions, geographic constraints, and social dynamics leads to considerable heterogeneity in commuting patterns, making it difficult to develop generation models that generalize across cities. To address this issue, we propose SEDAN, a Structure-Enhanced Diffusion model conditioned on Attributed Nodes for generalizable OD matrix generation. SEDAN models a city as an attributed graph. Each region is treated as a node with demographic and point-of-interest features, and commuting flows are modeled as weighted edges. Adjacency and distance matrices are incorporated to characterize spatial structure. Based on this representation, we design a fusion mechanism within SEDAN to jointly model semantic information and spatial information. Regional semantic attributes are used to model latent travel demand through graph-transformer-based node interactions, while spatial structure is injected into the generation process as explicit constraints. The adjacency matrix guides attention weights to strengthen interactions between neighboring regions. Meanwhile, the distance matrix serves as a diffusion condition to capture spatial proximity and travel impedance. The fusion of urban semantics and spatial constraints enables SEDAN to generate OD matrices that are both behaviorally plausible and geographically coherent. Experiments on real-world OD datasets from U.S. cities show that SEDAN achieves a 7.38% improvement in RMSE over the state-of-the-art baseline, WEDAN. It also remains robust across heterogeneous urban scenarios and varying structural patterns. Our work provides an effective and generalizable solution for commuting OD matrix generation. The code is available at this https URL.
[AI-191] EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems
【速读】:该论文旨在解决云服务系统中异常检测与定位(Anomaly Detection and Localization, ADL)问题,特别是针对现有方法主要依赖指标(metric)和日志(log)数据而忽视事件(event)数据的局限性。解决方案的关键在于提出首个开源的基于事件的ADL框架EventADL,其核心创新包括:首先通过离线训练阶段学习两种模式——事件语义模式(Event Semantic Patterns, ESPs),刻画系统实体间正常交互关系;以及事件频率模式(Event Frequency Patterns, EFPs),刻画已知ESP的正常出现频率;其次在在线检测阶段,利用上述模式识别偏离正常行为的异常事件;最后构建干预图(Intervention Graph)以建模近期系统交互与异常之间的因果关系,实现自动根因定位。该方案无需标签数据即可高效运行,并提供可解释的异常及其根因,实验表明其在真实云服务系统中实现了至少90%的F1分数和100%的前3位根因定位准确率。
链接: https://arxiv.org/abs/2605.00936
作者: Luan Pham,Victor Nicolet,Joey Dodds,Hui Guan,Daniel Kroening
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the FSE’26 Conference - Research Track
Abstract:Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.
[AI-192] CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining
【速读】:该论文旨在解决连续葡萄糖监测(CGM)在大规模人群部署中面临的两个核心问题:一是同一生理状态(如胰岛素抵抗、β细胞功能障碍)可通过多种模态(CGM时序数据、静脉OGTT、Glucodensity总结)呈现,导致单一视图表示在模态或场景迁移时无法有效泛化;二是现有基线方法在不同迁移场景下表现不一致。解决方案的关键在于提出CGM-JEPA框架——一种基于自监督预训练的方法,通过预测被掩码的潜在表示(而不是原始值)来实现对特定视图的抽象,从而捕捉更高层次的时间和分布结构。进一步引入X-CGM-JEPA,增加掩码Glucodensity跨视图目标以补充分布信息,显著提升了跨模态迁移性能与子群体公平性,在多个临床验证场景中均优于所有基线模型。
链接: https://arxiv.org/abs/2605.00933
作者: Hada Melino Muhammad,Zechen Li,Flora Salim,Ahmed A. Metwally
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continuous Glucose Monitoring (CGM) can detect early metabolic subphenotypes (insulin resistance, IR; \beta -cell dysfunction), but population-scale deployment faces two coupled problems. First, the same physiological state appears through multiple views (CGM time series, venous OGTT, Glucodensity summaries), so single-view representations fail to transfer when deployment shifts the modality or setting. Second, baselines perform inconsistently across these shifts. Both problems point to one remedy: representations that abstract away from any single view to capture higher-level temporal and distributional structure. We propose CGM-JEPA, a self-supervised pretraining framework which predicts masked latent representations rather than raw values, yielding abstraction that transfers across modalities. X-CGM-JEPA adds a masked Glucodensity cross-view objective for complementary distributional information. We pretrain on \sim 389k unlabeled CGM readings from 228 subjects and evaluate on two clinical cohorts ( N=27 and N=17 public-release subsets) across three regimes (cohort generalization, venous-to-CGM transfer, home CGM) under 20-iteration \times 2-fold cross-validation. X-CGM-JEPA ranks first or second on AUROC for both endpoints across all three regimes while no baseline does, exceeding the strongest baseline by up to +6.5 pp in cohort generalization and +3.6 pp in venous-to-CGM transfer (paired Wilcoxon, p0.001 ). Under modality shift, it matches mean AUROC while redistributing toward weaker subgroups (ethnicity AUROC gap shrinks 25-54%); on sparse in-domain venous data, the distributional view lifts label-aware clustering (ARI +39% , NMI +40% ). Code and weights: this https URL
[AI-193] Code World Model Preparedness Report
【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成与推理领域可能引发的前沿风险评估问题,尤其关注其在潜在灾难性风险场景下的表现。解决方案的关键在于通过预发布测试对 Code World Model (CWM) 在前沿人工智能框架识别出的高风险领域进行系统性评估,并量化其行为偏离对齐目标的可能性;结果显示 CWM 的风险水平未超出当前 AI 生态系统的现有风险范畴,因此决定以开放权重模型形式发布,以促进安全可控的模型演进与研究。
链接: https://arxiv.org/abs/2605.00932
作者: Daniel Song,Peter Ney,Cristina Menghini,Faizan Ahmad,Aidan Boyd,Nathaniel Li,Ziwen Han,Jean-Christophe Testud,Saisuke Okabayashi,Maeve Ryan,Jinpeng Miao,Hamza Kwisaba,Felix Binder,Spencer Whitman,Jim Gust,Esteban Arcaute,Dhaval Kapil,Jacob Kahn,Ayaz Minhas,Tristan Goodman,Lauren Deason,Alexander Vaughan,Shengjia Zhao,Summer Yue
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures
Abstract:This report documents the preparedness assessment of Code World Model (CWM), a model for code generation and reasoning about code from Meta. We conducted pre-release testing across domains identified in our Frontier AI Framework as potentially presenting catastrophic risks, and also evaluated the model’s misaligned propensities. Our assessment found that CWM does not pose additional frontier risks beyond those present in the current AI ecosystem. We therefore release it as an open-weight model.
[AI-194] PhaseNet: Phase-Aware Frequency-Domain Anomaly Detection for Industrial Control Systems via Phase Coherence Graphs
【速读】:该论文旨在解决工业控制系统(ICS)中多变量时间序列异常检测的问题,特别是针对现有方法仅依赖时域幅值信息而忽略频域相位信息所导致的检测性能局限。其解决方案的关键在于提出PhaseNet++,一种基于短时傅里叶变换(STFT)的频域自编码器,同时保留幅度和相位谱;并通过引入相位相干性指数(Phase Coherence Index, PCI),借鉴神经科学中的相位锁定值(Phase Locking Value)思想,将跨频段的成对相位一致性建模为连续邻接矩阵,引导图注意力网络优先在相位同步传感器间传播信息;此外,结合传感器令牌Transformer编码器与双头解码器,实现幅度与相位的联合重建,从而显著提升异常检测精度(F1-score达90.98%)。
链接: https://arxiv.org/abs/2605.00929
作者: Raviteja Bommireddy,Varshith Bandaru,Lohith Pakala,Pradeep Kumar B
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Multivariate time series anomaly detection in ICS has attracted growing attention due to the increasing threat of cyber-physical attacks on critical infrastructure. State-of-the-art methods model inter-sensor relationships from raw time-domain amplitude values, using graph neural networks, Transformers. However, these methods discard the phase spectrum produced by time frequency transformations, We argue that phase information constitutes a complementary and previously overlooked detection modality for ICS anomaly detection. We present PhaseNet++, a frequency-domain autoencoder that operates on the Short-Time Fourier Transform (STFT) of sliding sensor windows, retaining both magnitude and phase spectra. A Phase Coherence Index (PCI), inspired by the Phase Locking Value from neuroscience, summarizes pairwise phase consistency across frequency bins into a continuous adjacency matrix. This matrix guides a graph attention network that propagates information preferentially among phase-synchronized sensors. A sensor-token Transformer encoder captures system-wide structure, and a dual-head decoder reconstructs magnitude and phase jointly via circular and coherence-aware objectives. Evaluated on the Secure Water Treatment (SWaT) benchmark, PhaseNet++ achieves an F1-score of 90.98%, ROC-AUC of 95.66%, and average precision of 91.51%. Ablation studies show that the phase-aware front-end and PCI graph module together add only 264,816 parameters, demonstrating that the phase inductive bias is lightweight. While the absolute F1-score is second best than that of all recent raw-value methods evaluated under different protocols, we position this work as the first systematic study of phase-domain anomaly detection for ICS.
[AI-195] StyleShield: Exposing the Frag ility of AIGC Detectors through Continuous Controllable Style Transfer
【速读】:该论文旨在解决生成式 AI (Generative AI) 内容检测器在高风险场景(如学术诚信审查)中可靠性不足的问题,其核心矛盾在于:随着语言模型性能提升,AI生成文本与人类写作之间的统计边界逐渐模糊,而商业利益驱动下的检测服务与“去AI化”工具形成闭环链条,将内容质量评估异化为对来源的判断。解决方案的关键是提出 StyleShield,首个基于流匹配(flow matching)的条件文本风格迁移框架,直接在连续的 token 嵌入空间中操作,采用 DiT(Diffusion Transformer)主干网络并结合零初始化交叉注意力适配器(cross-attention adapters),以冻结的 Qwen-7B 表示作为条件;推理时引入 SDEdit 思想于文本嵌入空间,仅用一个参数 γ 实现对抗检测能力与语义保留之间的平滑控制,实验表明其在多领域中文基准上可实现 94.6% 的规避率(针对训练检测器)和 ≥99% 的规避率(针对三个未见检测器),同时保持 0.928 的语义相似度。此外,作者还设计 RateAudit 文档级调度算法,证明检测率可被任意设定,从而质疑基于分数的评估体系的有效性。
链接: https://arxiv.org/abs/2605.00924
作者: Guantian Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures. Code and model weights will be released upon acceptance
Abstract:AI-generated content (AIGC) detectors are increasingly deployed in high-stakes settings such as academic integrity screening, yet their reliability rests on a fundamental paradox: as language models are trained on human-written corpora, the statistical boundary between AI and human writing will inevitably dissolve as models improve. Commercial incentives have further distorted this landscape – detection services and “de-AIification” tools often operate within the same supply chain, replacing evaluation of content quality with judgment of content origin. We present StyleShield, the first flow matching framework for conditional text style transfer, operating directly in continuous token embedding space via a DiT backbone with zero-initialized cross-attention adapters conditioned on frozen Qwen-7B representations. At inference, we adapt the SDEdit paradigm from image synthesis to text embeddings, with a single parameter gamma providing smooth continuous control over the evasion-preservation trade-off. On a multi-domain Chinese benchmark, StyleShield achieves 94.6% evasion against the training detector and =99% against three unseen detectors, maintaining 0.928 semantic similarity. We further introduce RateAudit, a document-level scheduling algorithm that demonstrates detection-rate verdicts can be set to arbitrary values, directly questioning the reliability of score-based evaluation.
[AI-196] o Vibe Research or Not to Vibe Research? Generative AI in Qualitative Research
【速读】:该论文旨在解决当前定性研究领域中关于生成式 AI(Generative AI)是否适用于定性研究的争议问题。其解决方案的关键在于系统梳理了生成式 AI 在定性研究中的适用性讨论,并指出研究者的哲学立场(小-q,即实证主义或后实证主义,与大-Q,即非实证主义)是决定是否使用生成式 AI 的核心标准之一;此外,研究技能、伦理考量及个人偏好等因素也共同影响研究人员对是否采用 AI 工具的决策。
链接: https://arxiv.org/abs/2605.00922
作者: Katja Karhu,Kari Smolander,Jussi Kasurinen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures. Accepted to VibeX 2026: 1st International Workshop on Vibe Coding and Vibe Researching
Abstract:There has been intense debate among qualitative researchers about whether generative AI is suitable for qualitative research. In this paper, we summarize the broader ongoing discussion of generative AI in qualitative research and its implications for software engineering researchers. The qualitative research approach, small-q (positivist or post-positivist) or Big Q (non-positivist), is among the major criteria for determining whether generative AI can be used in qualitative research. In addition to research philosophy and research approach, skills, ethics, and personal preferences also play a role in researchers’ decisions about whether to use AI in qualitative research.
[AI-197] Accelerating battery research with an AI interface between FINALES and Kadi4Mat
【速读】:该论文旨在解决钠离子电池(Sodium-ion battery)在形成过程(formation process)中耗时过长的问题,该问题直接影响电池的循环寿命和终端性能(End Of Life, EOL performance)。研究目标是优化形成工艺,在缩短时间的同时最大化EOL性能,且通过减少实验次数来降低资源消耗并加速发现。解决方案的关键在于构建一个跨生态系统的互操作框架,整合FINALES与Kadi RDM平台:其中FINALES负责在POLiS MAP上规划和执行实验,而基于多目标批量贝叶斯优化(multi-objective batched Bayesian optimization)的主动学习代理(active-learning agent)在Kadi4Mat中指导实验选择,从而高效探索参数空间并逼近帕累托前沿(Pareto front)。此方法实现了自动化系统与人工流程之间的协同分布式协作,为电池研究中的数据驱动优化提供了可迁移的范式。
链接: https://arxiv.org/abs/2605.00909
作者: Giovanna Tosato(1),Leon Merker(1 and 2 and 3),Monika Vogler(3),Michael Selzer(1),Arnd Koeppe(1) ((1) Karlsruhe Institute of Technology, (2) Helmholtz Institute Ulm, (3) Technical University of Munich)
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注: Main manuscript: 21 pages, 9 figures. Supporting material: 3 pages, 5 figures. Submitted to “Batteries Supercaps”, currently under revision
Abstract:The time-consuming formation process critically impacts the longevity of sodium-ion coin cells and End Of Life (EOL) performance. This study aims to optimize formation protocols for duration efficiency, targeting high-performance outcomes while minimizing the number of experiments to reduce resource consumption and accelerate discovery. Specifically, we consider two potentially competing objectives: minimizing formation time and maximizing EOL performance. Beyond this application focus, we also present a methodological contribution: a framework designed to enable interoperability between the FINALES and Kadi RDM ecosystems, which we employ to tackle our optimization problem. In this setup, the FINALES framework orchestrates experiment planning and execution on the POLiS MAP, while an active-learning agent implemented within Kadi4Mat guides experiment selection, using multi-objective batched Bayesian optimization to efficiently explore the parameter space. This interoperability enhancement enables coordinated, distributed collaboration across automated systems and human-operated workflows, bridging multiple research centers. Using this approach, we iteratively explore the trade-off between formation time and EOL performance and identify candidate solutions approximating the Pareto front. The resulting workflow demonstrates the capability of interoperable infrastructures to facilitate data-driven optimization in battery research, and establishes a transferable framework applicable to diverse materials science and engineering optimization tasks.
[AI-198] he Oracles Fingerprint: Correlated AI Forecasting Errors and the Limits of Bias Transmission
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在作为预测工具时可能引发的“认知同质化”(epistemic monoculture)问题,即多个独立开发的LLM因共享相似的错误模式而削弱集体智慧的基础——个体预测误差的独立性。解决方案的关键在于通过三项实证研究揭示:尽管GPT-4o、Claude和Gemini由不同机构独立开发,其预测误差高度相关(平均成对相关系数r = 0.77),表明存在系统性偏差的共通来源;进一步发现人类群体预测并未被LLM引导形成新的偏差,反而在LLM出现前已具备与之相似的偏倚模式(r = 0.87),且LLM引入后这种相似性反而下降(r = -0.28),说明当前的“认知同质化”虽已存在(即AI系统间偏差趋同),但尚未显著影响人类决策,提示风险尚未激活,但仍需警惕其潜在放大效应。
链接: https://arxiv.org/abs/2605.00844
作者: Theodor Spiro
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages, 3 figures, 5 tables
Abstract:When large language models (LLMs) are consulted as forecasting tools, the independence of individual errors – the foundation of collective intelligence – may collapse. We test three conditions necessary for this “epistemic monoculture” to emerge. In Study 1, we show that GPT-4o, Claude, and Gemini exhibit highly correlated forecasting errors on 568 resolved binary prediction questions (mean pairwise error correlation r = 0.77, p 0.001; r = 0.78 excluding likely-leaked questions), despite being developed independently by different organizations. In Study 2, we test whether this correlated bias has propagated into human crowd forecasts, using a within-question design that tracks community prediction shifts across the ChatGPT launch boundary (November 2022). We find that community forecasts move in the direction predicted by LLMs (r = 0.20, p = 0.007), but this shift is fully explained by rational updating toward ground truth. In Study 3, we examine whether the category-level pattern of human forecasting errors increasingly resembles the LLM bias fingerprint. We find the opposite: pre-ChatGPT human biases already strongly resembled the LLM pattern (r = 0.87), while post-ChatGPT the resemblance weakened (r = -0.28). Together, these findings reveal an epistemic monoculture that is built but not yet activated: three nominally independent AI systems share the same failure modes, amplifying precisely the biases humans already hold.
[AI-199] Generative-AI and the transformation of workforce. A job postings-driven analysis
【速读】:该论文旨在解决生成式AI(Generative AI)如何重塑全球劳动力市场中的岗位要求、技能构成及行业动态这一核心问题。其解决方案的关键在于构建了一个大规模、多源的数据集(涵盖2018–2025年超过15万条英文职位招聘信息),并采用融合词法提取、语义框架分析、主题建模(BERTopic、LDA、KMeans)与时间序列预测(ARIMA)的综合分析框架,量化不同技能维度(如AI_Data、Routine、Soft_Meta、Domain_Specific和Leadership)的变化趋势及其跨行业关联性,并通过Sentence-transformer嵌入与余弦相似度计算“框架指数”以区分增强型(augmentation-oriented)与自动化导向(automation-oriented)的话语模式,从而揭示生成式AI在职场中作为增强工具而非替代力量的结构性演变路径。
链接: https://arxiv.org/abs/2605.00843
作者: Diana Maria Popa,Simona-Vasilica Oprea,Adela Bâra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates how generative-artificial intelligence AI is reshaping job requirements, skill compositions and sectoral dynamics across global labor markets. It examines the evolving frequency and framing of AI-related competencies in job postings, exploring whether generative-AI functions primarily as an augmentative or substitutive force in the workplace. A large-scale, multi-source corpus of over 150,000 English-language job postings 2018-2025 is compiled from twelve open-access datasets and one public API. The analytical framework integrates lexical skill extraction, semantic framing, topic modeling, BERTopic, LDA, KMeans, and time-series forecasting ARIMA. Skill mentions are categorized into five dimensions: AI_Data, Routine, Soft_Meta, Domain_Specific and Leadership, while cross sectoral analyses and correlation matrices quantify interdependencies between competencies. Sentence-transformer embeddings and cosine similarity are used to compute a Framing Index, distinguishing augmentation- versus automation-oriented discourse. Investigating job postings, our research contributes a replicable, data driven methodology for mapping the diffusion of AI related skills across industries and time. Results reveal a sharp post-2021 increase in AI-related skill mentions: prompt engineering, fine-tuning and model validation, accompanied by a decline in routine tasks: data entry and manual coding. Forecasts suggest sustained growth in AI_Data and Soft_Meta skills through 2025, signaling a structural convergence toward hybrid human-AI expertise as a new foundation of employability.
[AI-200] Understanding Emergent Misalignment via Feature Superposition Geometry ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中“涌现错位”(emergent misalignment)问题,即在对窄域、非有害任务进行微调时,意外诱发有害行为的现象。其核心解决方案基于特征超叠加(feature superposition)的几何视角:由于模型中的特征以重叠表示编码,微调过程中放大目标特征的同时,也会因相似性无意中增强邻近的有害特征。关键创新在于通过稀疏自编码器(sparse autoencoders, SAEs)识别与错位诱导数据及有害行为相关的特征,并验证这些特征在嵌入空间中几何上更接近;进而提出一种基于几何感知的训练样本过滤方法,剔除最接近有毒特征的样本,使错位程度降低34.5%,显著优于随机删除,且效果接近或优于LLM-as-a-judge的筛选策略。
链接: https://arxiv.org/abs/2605.00842
作者: Gouki Minegishi,Hiroki Furuta,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL2026
Abstract:Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this effect and empirically test it in multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, GPT-OSS 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach, filtering training samples closest to toxic features, reduces misalignment by 34.5%, substantially outperforming random removal and achieving comparable or slightly lower misalignment than LLM-as-a-judge-based filtering. Our study links emergent misalignment to feature superposition, providing a basis for understanding and mitigating this phenomenon.
[AI-201] AI Agents for Sustainable SMEs: A Green ESG Assessment Framework
【速读】:该论文旨在解决欧洲中小企业(SMEs)在环境、社会和治理(ESG)绩效评估中缺乏高效、可扩展且准确的自动化工具的问题。传统方法依赖人工评分与专家判断,难以满足大规模应用需求,限制了ESG管理的有效性和政策干预的及时性。解决方案的关键在于构建一个基于AI代理(AI agent)的自动化框架,利用n8n自动化平台集成大型语言模型(LLMs),将专家验证的ESG基准分数应用于实际数据,实现对中小企业的ESG分类与情境化建议生成,从而显著提升评估的一致性与可操作性,支撑欧洲绿色协议(European Green Deal)目标下的可持续发展实践。
链接: https://arxiv.org/abs/2605.00841
作者: Viet Trinh,Tan Nguyen,Minh-Huyen Phan,Quan Luu
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:This study presents a novel, AI-driven framework for assessing Environmental, Social, and Governance (ESG) performance in European small and medium-sized enterprises (SMEs). An initial phase established expert-validated ESG baseline scores from a subset of the Flash Eurobarometer FL549 survey data. In the second phase, a scalable AI agent system, built on the n8n automation platform, applied these baselines to perform automated ESG classification and generate contextual recommendations using large language models (LLMs). The results demonstrate the AI system’s high consistency with human-derived outputs, thereby supporting more effective monitoring and intervention strategies aligned with the European Green Deal.
[AI-202] 2026 Roadmap on Artificial Intelligence and Machine Learning for Smart Manufacturing
【速读】:该论文旨在解决人工智能(AI)与机器学习(ML)在智能制造中部署时面临的诸多关键挑战,包括工业大数据的复杂性、有效数据管理、异构传感与控制系统集成,以及高风险工业环境中对可信、可解释和可靠运行的需求。其解决方案的关键在于提出一个三部分的路线图:首先梳理AI在智能制造中的基础与发展趋势;其次聚焦AI已推动突破的核心领域,如工业大数据分析、先进感知、自主系统、增材制造、数字孪生、机器人技术、供应链优化及可持续制造;最后探索非传统ML方法的新前沿,例如物理信息驱动的AI、生成式AI(Generative AI)、语义AI、可解释AI(Explainable AI)、RAMS(可靠性、可用性、可维护性和安全性)、以数据为中心的计量学、大语言模型(LLMs)及基础模型,从而为方法创新、系统集成策略和工业应用落地提供明确方向,助力智能制造实现可靠、可持续且可扩展的发展。
链接: https://arxiv.org/abs/2605.00839
作者: Jay Lee,Hanqi Su,Marco Macchi,Adalberto Polenghi,Wei Wu,Zhiheng Zhao,George Q.Huang,Kiva Allgood,Devendra Jain,Benedikt Gieger,Vibhor Pandhare,Soumyabrata Bhattacharjee,Ram Mohril,Lingbao Kong,Qiyuan Wang,Xinlan Tang,Sungjong Kim,Chan Hee Park,Byeng D. Youn,Guo Dong Goh,Xi Huang,Wai Yee Yeong,Yung C Shin,He Zhang,Zitong Wang,Fei Tao,Jagjit Singh Srai,Satyandra K. Gupta,Byung Gun Joung,Albin John,John W. Sutherland,Sang Won Lee,Olga Fink,Vinay Sharma,Faez Ahmed,Wei Chen,Mark Fuge,Arild Waaler,Martin G. Skjæveland,Dimitris Kyritsis,Wei Chen,VispiNevile Karkaria,Yi-Ping Chen,Ying-Kuan Tsai,Joseph Cohen,Xun Huan,Jing Lin,Liangwei Zhang,Gregory W. Vogl,Aaron W. Cornelius,Xiaodong Jia,Dai-Yan Ji,Takanobu Minami,Ruoxin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted for publication in the Journal Machine Learning: Engineering
Abstract:The evolution of artificial intelligence (AI) and machine learning (ML) is reshaping smart manufacturing by providing new capabilities for efficiency, adaptability, and autonomy across industrial value chains. However, the deployment of AI and ML in industrial settings still faces critical challenges, including the complexity of industrial big data, effective data management, integration with heterogeneous sensing and control systems, and the demand for trustworthy, explainable, and reliable operation in high-stakes industrial environments. In this roadmap, we present a comprehensive perspective on the foundations, applications, and emerging directions of AI and ML in smart manufacturing. It is structured in three parts. The first highlights the foundations and trends that frame the evolution of AI in smart manufacturing. The second focuses on key topics where AI is already enabling advances, including industrial big data analytics, advanced sensing and perception, autonomous systems, additive and laser-based manufacturing, digital twins, robotics, supply chain and logistics optimization, and sustainable manufacturing. The third section explores non-traditional ML approaches that are opening new frontiers, such as physics-informed AI, generative AI, semantic AI, advanced digital twins, explainable AI, RAMS, data-centric metrology, LLMs, and foundation models for highly connected and complex manufacturing systems. By identifying both opportunities and remaining barriers across these areas, this roadmap outlines the advances needed in methods, integration strategies, and industrial adoption. We hope this roadmap will serve as a guide for researchers, engineers, and practitioners to accelerate innovation, align academic and industrial priorities, and ensure that AI-driven smart manufacturing delivers reliable, sustainable, and scalable impact for the future of manufacturing ecosystems.
[AI-203] Agent opic: A Generative AI Agent Workflow for Explainable Topic Modeling
【速读】:该论文旨在解决传统主题建模方法(如LDA和BERTopic)在话题分配与分组过程中缺乏透明性的问题,即用户难以理解模型如何得出特定话题及其结构。解决方案的关键在于提出一种基于代理(Agent-based)的工作流——Agentopic,其通过多个协作代理(agents)完成话题识别、验证、层次聚类及自然语言解释等步骤,从而实现可解释的主题建模。该设计使用户能够追溯话题分配的推理过程,在保持高准确率的同时显著提升模型的可解释性,例如在BBC数据集上达到F1-score 0.95,优于LDA并接近BERTopic。
链接: https://arxiv.org/abs/2605.00833
作者: Brice Valentin Kok-Shun,Johnny Chan,Gabrielle Peko,David Sundaram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures
Abstract:Agentopic is a novel agent-based workflow for explainable topic modeling that leverages the reasoning capabilities of Large Language Models (LLMs). Existing topic modeling approaches such as Latent Dirichlet Allocation (LDA) and BERTopic often lack transparency on how topics are assigned or grouped. Agentopic addresses this by using multiple agents that collaboratively perform topic identification, validation, hierarchical grouping, and natural language explanation. This design enables users to trace the reasoning behind topic assignments, enhancing interpretability without sacrificing accuracy. When seeded with topics from the British Broadcasting Corporation (BBC) dataset, Agentopic achieves an F1-score of 0.95, matching GPT-4.1, improving on LDA (0.93), and close to BERTopic (0.98). We used Agentopic to augment the BBC dataset with generated explanations to improve the dataset’s richness and context. The unseeded Agentopic generated 2045 semantically coherent topics organized across six hierarchical levels, vastly enriching the original five-category structure. By embedding explainability throughout the workflow, Agentopic offers an interpretable alternative to black-box models, making it particularly valuable for crucial applications like finance and healthcare.
[AI-204] GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)推理服务中因长序列任务易受软硬件故障影响而导致的高成本失败、资源浪费及用户体验下降问题,其核心挑战在于状态化键值(Key-Value, KV)缓存随序列长度增长而变得庞大且脆弱,成为分布式服务系统中的关键瓶颈。解决方案的关键是提出GhostServe,通过在主机内存中对流式KV缓存进行擦除编码(erasure coding),生成并存储奇偶校验分片(parity shards),从而在设备故障时实现快速重建丢失的KV缓存,避免昂贵的全量重新计算或状态复制,显著降低检查点延迟和恢复延迟,提升LLM服务的可用性与成本效益。
链接: https://arxiv.org/abs/2605.00831
作者: Shakya Jayakody,Youpeng Zhao,Chinmay Dhanraj Nehate,Jun Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: MLSys 2026
Abstract:The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems. In this work, we propose GhostServe, a novel checkpointing solution to facilitate fault-tolerant LLM serving. Specifically, GhostServe protects the streaming KV cache in the shadow by applying erasure coding to generate and store the parity shards in host memory. In the event of device failures, GhostServe enables fast reconstruction of the lost KV cache, allowing the inference process to resume seamlessly without costly full recomputation or state replication. Evaluations demonstrate that GhostServe reduces checkpointing latency by up to 2.7x and recovery latency by 2.1x for a single batch, and 1.2x median response latency compared to existing methods, in the presence of system failures, paving the way for high-availability and cost-effective LLM serving at scale.
[AI-205] Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在通过工具调用协议(如Model Context Protocol, MCP)与外部系统交互时,因每次会话均需重新推理所有工具调用而导致的高Token消耗问题,尤其是在重复执行相同任务时效率低下。解决方案的关键在于提出MCP Workflow Engine——一种原生支持MCP的编排层,它将决策智能(intelligence,即决定做什么)与执行逻辑(execution,即具体如何做)解耦:代理仅需一次推理生成一个声明式工作流蓝图(JSON文档),其中包含参数化模板、循环、并行分支和数据管道等结构;后续执行仅需调用单一run_workflow工具,无论蓝图复杂度如何,token消耗恒定为一次调用量。该方案通过形式化MCP中介者架构(MCP Mediator architectural pattern)实现跨服务器协同,并在生产级Kubernetes配置管理数据库(CMDB)同步任务中验证其有效性,显著降低token成本(>99%)、提升执行速度(<45秒完成超1200节点图谱)并确保运行时零代理参与的确定性与幂等性。
链接: https://arxiv.org/abs/2605.00827
作者: Abhinav Singh Parmar
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 16 pages, 5 figures
Abstract:Large Language Model (LLM) agents increasingly interact with external systems through tool-calling protocols such as the Model Context Protocol (MCP). In prevailing architectures, the agent must reason about every tool invocation in every session, consuming tokens proportional to the number of actions performed–even when the task has been solved before. We present the MCP Workflow Engine, a novel MCP-native orchestration layer that decouples intelligence (deciding what to do) from execution (carrying it out). An agent reasons once to produce a declarative workflow blueprint–a JSON document specifying a directed sequence of MCP tool calls with parameterized templates, loops, parallel branches, and data piping. Subsequent executions are triggered by a single run_workflow tool call, consuming one invocation’s worth of tokens regardless of the blueprint’s internal complexity. We formalize the MCP Mediator architectural pattern–an MCP server that simultaneously acts as a client to downstream MCP servers–and implement it in TypeScript against the MCP SDK. We evaluate the engine on a production-scale Kubernetes CMDB synchronization task spanning 67 orchestrated steps across 2 MCP servers, 38 namespaces, 13 worker nodes, and 22 distinct resource types. The engine reduces per-execution token cost by over 99%, completes the full cluster graph–comprising 1,200+ nodes and 2,800+ relationships across 20 relationship types–in under 45 seconds, and achieves deterministic, idempotent execution with zero agent involvement at run time.
[AI-206] he Topology of Multimodal Fusion: Why Current Architectures Fail at Creative Cognition
【速读】:该论文旨在解决当前多模态人工智能(Multimodal AI)架构中存在的结构性局限问题,该局限源于拓扑而非参数层面——即现有模型如对比对齐(CLIP)、交叉注意力融合(GPT-4V/Gemini)和扩散生成模型普遍遵循“模态可分性”的几何先验,形成所谓的“接触拓扑”(contact topology),从而限制了跨模态深度融合与创造性协同的能力。解决方案的关键在于提出一种基于哲学、认知科学与数学三重支柱的“十字形框架”(cruciform framework),其中以“象”(xiang,操作性图式)为核心节点,实现言说(saying)与展示(showing)的互渗,并通过双重“化裁”(huacai)机制驱动创生性转化(chuanghua)与制度化固化(huacai)的双层动态演化;同时借助纤维丛(fiber bundle)与杨-Mills曲率的形式化建模,将该结构映射为可计算的拓扑语义空间,并设计神经微分方程(Neural ODEs)结合拓扑正则化(topological regularization)的UOO实现路径,辅以ANALOGY-MM与META-TOP等基准测试体系,系统验证跨文明拓扑同构性,最终构建具有明确终止条件的分阶段实验路线以确保理论可证伪性。
链接: https://arxiv.org/abs/2604.04465
作者: Xiujiang Tan(Guangzhou Academy of Fine Arts, Guangzhou, China)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Expanded 11 technical improvements; 5 reference corrections; Appendix B pseudocode added. ~43 pages, 5 figures. Chinese philosophical terms romanized. Companion monograph available separately
Abstract:This paper identifies a structural limitation in current multimodal AI architectures that is topological rather than parametric. Contrastive alignment (CLIP), cross-attention fusion (GPT-4V/Gemini), and diffusion-based generation share a common geometric prior – modal separability – which we term contact topology. The argument rests on three pillars with philosophy as the generative center. The philosophical pillar reinterprets Wittgenstein’s saying/showing distinction as a problem rather than a conclusion: where Wittgenstein chose silence, the Chinese craft epistemology tradition responded with xiang (operative schema) – the third state emerging when saying and showing interpenetrate. A cruciform framework (dao/qi x saying/showing) positions xiang at the intersection, executing dual huacai (transformation-and-cutting) along both axes. This generates a dual-layer dynamics: chuanghua (creative transformation as spontaneous event) and huacai (its institutionalization into repeatable form). The cognitive science pillar reinterprets DMN/ECN/SN tripartite co-activation through the pathological mirror: overlap isomorphism vs. superimposition collapse in a 2D parameter space (coupling intensity x regulatory capacity). The mathematical pillar formalizes these via fiber bundles and Yang-Mills curvature, with the cruciform structure mapped to fiber bundle language. We propose UOO implementation via Neural ODEs with topological regularization, the ANALOGY-MM benchmark with error-type-ratio metric, and the META-TOP three-tier benchmark testing cross-civilizational topological isomorphism across seven archetypes. A phased experimental roadmap with explicit termination criteria ensures clean exit if falsified.
[AI-207] Autonomous QA Agent : A Retrieval-Augmented Framework for Reliable Selenium Script Generation
【速读】:该论文旨在解决软件测试中将需求转化为可执行测试脚本时存在的手动化和易出错问题,特别是大型语言模型(Large Language Models, LLMs)在生成 Selenium 脚本时容易虚构不存在的 UI 元素(即“幻觉”问题)。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的自主 QA 智能体,通过将项目特定文档(如 Markdown、PDF、HTML)嵌入向量数据库,并在生成脚本前检索相关上下文信息,从而将脚本生成过程锚定在实际的 DOM 结构上,有效抑制了 LLM 的幻觉现象。实验表明,该方法在 20 个电商测试场景中实现了 100% 的语法有效性与 90% 的执行成功率,显著优于标准 LLM 方法(30%)。
链接: https://arxiv.org/abs/2601.06034
作者: Dudekula Kasim Vali
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 figures, 3 tables
Abstract:Software testing is critical in the software development lifecycle, yet translating requirements into executable test scripts remains manual and error-prone. While Large Language Models (LLMs) can generate code, they often hallucinate non-existent UI elements. We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds Selenium script generation in project-specific documentation and HTML structure. By ingesting diverse formats (Markdown, PDF, HTML) into a vector database, our system retrieves relevant context before generation. Evaluation on 20 e-commerce test scenarios shows our RAG approach achieves 100% (20/20) syntax validity and 90% (18/20, 95% CI: [85%, 95%], p 0.001) execution success, compared to 30% for standard LLM generation. While our evaluation is limited to a single domain, our method significantly reduces hallucinations by grounding generation in actual DOM structure, demonstrating RAG’s potential for automated UI testing.
[AI-208] A second-order method on the Stiefel manifold via Newtonunicodex2013Schulz
【速读】:该论文旨在解决在Stiefel流形上优化问题中,传统无收缩(retraction-free)方法多为一阶方法、难以满足高精度要求的问题。其解决方案的关键在于提出一种无需引入收缩映射的二阶方法,该方法通过将更新方向分解为两个分量实现:一是沿约束函数等值面切空间的分量,用于降低目标函数值;二是沿同一等值面法空间的分量,用于减少不可行性。其中,法向分量由Newton–Schulz迭代构造,该迭代本质上是正交化固定点迭代,并被证明在Stiefel流形上沿法空间移动;而切向分量则通过引入Newton–Schulz的修正牛顿方程来求解。理论分析表明该方法具有局部二次(或不精确版本的超线性)收敛性,数值实验验证了其在正交Procrustes问题、主成分分析和真实数据独立成分分析中的优越性能。
链接: https://arxiv.org/abs/2605.02838
作者: Xinhui Xiong,Bin Gao,P.-A. Absil
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: 25 pages, 4 figures
Abstract:Retraction-free approaches offer attractive low-cost alternatives to Riemannian methods on the Stiefel manifold, but they are often first-order, which may limit the efficiency under high-accuracy requirements. To this end, we propose a second-order method landing on the Stiefel manifold without invoking retractions, which is proved to enjoy local quadratic (or superlinear for its inexact variant) convergence. The update consists of the sum of (i) a component tangent to the level set of the constraint-defining function that aims to reduce the objective and (ii) a component normal to the same level set that reduces the infeasibility. Specifically, we construct the normal component via Newton \unicodex2013 Schulz, a fixed-point iteration for orthogonalization. Moreover, we establish a geometric connection between the Newton \unicodex2013 Schulz iteration and Stiefel manifolds, in which Newton \unicodex2013 Schulz moves along the normal space. For the tangent component, we formulate a modified Newton equation that incorporates Newton \unicodex2013 Schulz. Numerical experiments on the orthogonal Procrustes problem, principal component analysis, and real-data independent component analysis illustrate that the proposed method performs better than the existing methods.
[AI-209] Entanglement is Half the Story: Post-Selection vs. Partial Traces
【速读】:该论文旨在解决传统经典张量网络与量子张量网络在机器学习建模中能力差异的问题,特别是如何通过统一框架实现两者的平滑过渡与协同优化。解决方案的关键在于提出一种混合张量网络架构,该架构利用量子计算机对经典张量网络进行推理,并引入“后选择(post-selection)”作为核心机制——其强度决定了量子约束在张量网络中的施加程度。进一步地,作者定义了一个新的超参数来控制从混合模型向纯量子张量网络的过渡过程,该超参数可训练地分配有限的后选择资源,从而提升量子机器学习模型的实际性能。
链接: https://arxiv.org/abs/2605.02385
作者: Gustav J L Jäger,Krzysztof Bieniasz,Martin B Plenio,Hans-Martin Rieser
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures
Abstract:While tensor networks have their traditional application in simulating quantum systems, in the recent decade they have gathered interest as machine learning models. We combine the experience from both fields and derive how quantum constraints placed on a tensor network manifest a change in capabilities. To this end, we employ a method of inference of classical tensor networks on a quantum computer to define a hybrid architecture. This hybrid tensor network is a practical unified framework for it’s classical and quantum tensor network edge cases. We identify post-selection as the important property on which this interpolation hinges. The amount of post-selection corresponds to the level to which quantum constraints are enforced on the tensor network. On this basis, we propose a new hyperparameter which controls the transition between the hybrid and the quantum tensor network. In the comparison of classical and quantum tensor networks it complements the bond dimension. Quantum machine learning is improved by using the hyperparameter to allocate the practically limited post-selection to the quantum model in a trainable manner.
[AI-210] rees and Graphs with Non Log-concave Dominating Set Sequence via AI Tools
【速读】:该论文旨在解决图论中支配集序列(dominating set sequence)的对数凹性(log-concavity)问题,即验证或反例说明此类序列是否满足对数凹性这一重要组合性质。研究发现,存在非对数凹的图与树的例子,这些反例通过基于Transformer架构的强化学习软件PatternBoost生成;同时,作者通过类比Bautista-Ramos关于独立集序列的构造方法,证明了对于任意正整数 $ m $,均存在一棵树使得其支配集序列在至少 $ m $ 个索引处不满足对数凹性。关键解决方案在于利用机器学习工具自动搜索反例,并结合结构化图类(如毛虫图,caterpillar graphs)的分析,揭示了特定图类中支配集序列仍保持对数凹性的条件,从而深化了对支配集序列结构性质的理解。
链接: https://arxiv.org/abs/2605.02193
作者: Alina Du,Steven Heilman,Greta Panova
机构: 未知
类目: Combinatorics (math.CO); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注: 21 pages, 8 figures
Abstract:We give new examples of graphs and trees with dominating set sequences that are not log-concave. These examples were generated by PatternBoost, a transformer-based reinforcement learning software developed by Charton-Ellenberg-Wagner-Williamson. We also show: for any positive integer m , there exists a tree whose dominating set sequence is not log-concave for at least m indices by modifying a similar construction of Bautista-Ramos for the independent set sequence. We show that a large class of caterpillar graphs has log-concave dominating set sequences. A continuous analogue of the sequence is also log-concave for all graphs.
[AI-211] he Causal Description Gap: Information-Theoretic Separations Across Pearls Hierarchy
【速读】:该论文旨在解决因果推理中不同层级查询(观测、干预、反事实)之间的信息复杂度差异问题,具体量化了在已知低层级因果答案的前提下,指定更高级别因果答案所需额外的比特数。其核心解决方案是通过引入“查询类描述长度”(query-class description length),即由结构因果模型(Structural Causal Model, SCM)诱导的查询回答预言机的柯尔莫哥洛夫复杂度(Kolmogorov complexity),系统刻画了因果层级间的压缩效率边界。关键创新在于构造了一类二值无环SCM,其观测分布具有常数描述长度,而单变量干预答案预言机的描述长度为Θ(n²),从而实现观测到干预的二次级描述长度分离;进一步地,通过度敏感上界证明有限门结构(finite-gate-schema)SCM的观测-干预差距最多为O(nd log(en/d) + n log n),表明该二次构造在高密度情形下是阶最优的,且在固定ε精度的总变差近似下仍保持分离性。这一框架揭示了因果层级间的信息增益本质来源于残余高阶歧义性的对数增长。
链接: https://arxiv.org/abs/2605.02177
作者: Seyed Morteza Emadi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Pearl’s causal hierarchy shows that observational, interventional, and counterfactual queries are qualitatively distinct. We ask a quantitative version of this question: how many additional bits are needed to specify higher-rung causal answers once lower-rung answers are known? We formalize this via query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries. Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length \Theta(n^2) . A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree d have observational-interventional gap at most O(nd \log(en/d) + n \log n) , making the quadratic construction order-optimal in the dense regime and a rooted-tree construction order-optimal for bounded indegree. The quadratic separation persists under \varepsilon -accurate total-variation descriptions for every fixed \varepsilon 1/4 . At the next rung, the full hard-do interventional oracle can still leave a \Theta(n) counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms.
[AI-212] Context-Aware Wireless Token Communication via Joint Token Masking and Detection
【速读】:该论文旨在解决传统无线通信系统在处理基于token的语言驱动应用时效率低下的问题,即现有方案忽略token间的上下文依赖关系并采用均匀的资源分配策略,导致在信道劣化条件下无线资源利用不充分。其解决方案的关键在于提出一种上下文感知的token通信框架,通过共享的掩码语言模型(Masked Language Model, MLM)作为收发端共有的上下文先验,在接收端基于贝叶斯框架融合信道似然与MLM提供的上下文先验实现鲁棒的token检测,并在发送端设计上下文感知的token掩蔽策略,仅传输可被可靠推断的token,从而将有限功率集中于更具信息量的token上,最终形成统一的收发协同优化机制,显著提升重建性能。
链接: https://arxiv.org/abs/2605.02123
作者: Junyong Shin,Joohyuk Park,Yongjeong Oh,Jihong Park,Jinho Choi,Yo-Seb Jeon
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing use of token-based representations in language-driven applications has motivated wireless token communication, where tokens are treated as fundamental units for transmission. However, conventional communication systems overlook dependencies among tokens and allocate transmission resources uniformly, leading to inefficient use of limited wireless resources under channel impairments. In this paper, we propose a context-aware token communication framework that leverages a masked language model (MLM) as a shared contextual model between the transmitter (Tx) and receiver (Rx). At the Rx, we develop a context-aware token detection method that integrates channel likelihoods with MLM-based contextual priors under a Bayesian formulation, enabling robust token inference over noisy channels. At the Tx, we propose a context-aware token masking strategy that selectively omits tokens that can be reliably inferred at the Rx, allowing the available power budget to be concentrated on more informative tokens. These components are jointly designed through a shared MLM, establishing a unified Tx-Rx framework for efficient token transmission and detection. Simulation results demonstrate that the proposed framework significantly improves reconstruction performance compared to conventional and existing token communication schemes, achieving up to 1.77X and 1.63X performance gains on the Europarl corpus and WikiText-103 datasets, respectively.
[AI-213] Discover Fast Power Allocation Solution for Multi-Target Tracking via AlphaEvolve Evolution
【速读】:该论文旨在解决雷达资源分配中的高效性与计算复杂性难题,尤其针对多目标跟踪场景下实时调度、鲁棒泛化和低数据依赖性的需求。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)引导的进化搜索方法(AlphaEvolve),通过将高维雷达状态编码为物理启发式特征,演化出一个紧凑且可解释的评分函数,并借助确定性约束满足变换将其转化为可行的功率分配方案,从而在保持近优跟踪精度(平均性能损失仅1.51%)的同时实现超过三个数量级的速度提升。
链接: https://arxiv.org/abs/2605.01794
作者: Zhenkang Hou,Wenqiang Pu,Junkun Yan,Rui Zhou,Hongwei Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient radar resource allocation is a fundamental yet computationally challenging problem, as optimal solutions typically require iterative optimization with high complexity. Motivated by the need for real-time scheduling, robust generalization, and low data dependency, this paper proposes a novel paradigm that leverages large language model (LLM)-guided evolutionary search (AlphaEvolve) to autonomously discover a closed-form power allocation solution for multi-target tracking. The approach encodes high-dimensional radar states into physically inspired features, then evolves a compact and interpretable scoring function, which is transformed to feasible power allocations via a deterministic constraint-satisfying transformation. Extensive experiments demonstrate that the discovered closed-form solution achieves near-optimal tracking accuracy (average relative performance loss of only 1.51% ), reliable generalization across diverse scenarios and target counts, and over three orders of magnitude speedup compared to conventional iterative solvers. These results highlight the potential of LLM-guided symbolic search to revolutionize not only radar resource management but also broader classes of engineering optimization problems.
[AI-214] Data driven approach for Outdoor Channel Prediction in 5G and Beyond
【速读】:该论文旨在解决5G及 beyond 无线通信网络中传统信道估计方法存在的计算复杂度高和通信开销大的问题。传统方法依赖周期性发送导频信号进行信道估计并反馈至基站(Base Station, BS),导致系统资源消耗较大。为此,论文提出一种基于数据驱动的信道估计方案,其关键在于利用射线追踪(Ray Tracing)生成大量训练数据,并结合机器学习模型(包括线性回归、支持向量回归和决策树回归)建立用户位置与信道系数之间的映射关系。实验表明,在7GHz频段下,线性回归模型表现最优,平均绝对误差(MAE)为 7.5155×10−5,均方根误差(RMSE)为 9.2861×10−5,验证了该方法在降低复杂度的同时仍能实现高精度信道估计的可行性,具备部署为数字孪生(Digital Twin)技术的基础潜力。
链接: https://arxiv.org/abs/2605.01777
作者: A. Sathi Babu,V. Udaya Sankar,Vishnu Ram OV
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, conference paper
Abstract:An evolution of Wireless Communications towards 5G and beyond provides improved user experience in terms of quality of services. Understanding and estimating Channel information plays crucial role in providing better user experience. Traditional methods of channel estimation involves periodically sending pilots (known signals), estimating channel and send back estimated channel information to the BS which increases computational complexity and communication complexity. Hence, we focus on data driven approach for channel estimation. This work can be deployed as Digital twin in 5G and beyond wireless networks. In this work, we explore a channel estimation mechanism at 7GHz frequency band for a given user location. This work involves data generation using Ray tracing mechanism and Machine learning model training that contains feature variables such as transmitter location, user location and target variable as channel coefficient . We explored Linear Regression, Support Vector Regression and Decision Tree Regression. We found via simulations that Linear Regression performs (with MAE of \mathbf7.5155\times10^-5 and RMSE of \mathbf9.2861\times10^-5 ) better than Support Vector Regression and Decision Tree Regression.
[AI-215] Missingness-aware Data Imputation via AI-powered Bayesian Generative Modeling
【速读】:该论文旨在解决现代数据科学中缺失数据插补(Missing Data Imputation)的核心挑战,尤其是在需要量化不确定性的情境下。传统方法通常仅提供点估计或隐式处理缺失机制,难以准确反映插补结果的不确定性。解决方案的关键在于提出MissBGM,一种基于贝叶斯生成建模(Bayesian Generative Modeling)的插补方法,它显式且联合建模数据生成机制与缺失机制,从而在插补结果上提供严格的后验不确定性估计,而非单一点估计。该方法通过交替更新缺失值、模型参数和潜在变量的随机优化框架实现高效计算,并在理论和实证层面均证明其收敛性与优越性能。
链接: https://arxiv.org/abs/2605.01676
作者: Qiao Liu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Missing data imputation remains a fundamental challenge in modern data science, especially when uncertainty quantification is essential. In this work, we propose MissBGM, an AI-powered missing data imputation method via Bayesian generative modeling that bridges the expressive flexibility of neural networks with the statistical rigor of Bayesian inference. Unlike existing methods that often focus on point estimates or treat the missingness mechanism implicitly, MissBGM explicitly and jointly models the data-generating and missingness mechanisms, providing principled posterior uncertainty over imputations rather than a single point estimate. We develop a stochastic optimization framework with alternating updates among missing values, model parameters, and latent variables until convergence. Our theoretical analysis shows that estimates of missing values from MissBGM converge consistently under mild assumptions. Empirically, we demonstrate that MissBGM achieves superior performance over traditional imputers and recent neural network-based methods across extensive experimental settings. These results establish MissBGM as a principled and scalable solution for modern missing data imputation. The code for MissBGM is open sourced at this https URL.
[AI-216] From Cortical Synchronous Rhythm to Brain Inspired Learning Mechanism: An Oscillatory Spiking Neural Network with Time-Delayed Coordination
【速读】:该论文旨在解决如何在神经网络中模拟人类认知过程中基于脉冲动力学的协同同步机制,以实现高效的信息处理与高级认知功能。其核心问题是现有深度学习模型难以捕捉脑内信息编码中同时依赖发放率和精确突触时序的特性,且缺乏对分布式神经电路中自组织同步动态的建模能力。解决方案的关键在于提出一种基于脉冲同步的神经网络(S2-Net),通过微观层面的脉冲神经元动力学与宏观层面振荡同步机制之间的迭代上下文交互,使认知级神经同步性得以涌现;其中,利用时间延迟同步形式建模部分和瞬时同步状态,并结合有限记忆窗口累积历史脉冲活动,从而实现对异质神经放电的顶层调控,最终以节奏性时序作为控制机制提升信息处理效率。
链接: https://arxiv.org/abs/2605.01656
作者: Tingting Dan,Guorong Wu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 6 figures
Abstract:Human cognition emerges from coordinated spiking dynamics in distributed neural circuits, where information is encoded via both firing rates and precise spike timing determined by brain rhythms. Inspired by this notion, we propose a brain-inspired learning primitive in which cognition-level neural synchrony emerges through iterative bottom-up and top-down interactions between micro-scale dynamics of spiking neurons and a macro-scale mechanism of oscillatory synchronization. Specifically, we model each parcel (e.g., a cortical region or an image pixel) in the target system as a spiking neuron embedded in a predefined connectivity scaffold. Low-level information is encoded in a spatiotemporal domain, where neurons are selectively grouped and fire spontaneously over time through self-organized dynamics. In the bottom-up route, oscillatory synchronization is formed from past spiking activity accumulated over a finite memory window. Since brain dynamics operate in a regime of partial and transient synchronization rather than global phase locking, we model oscillatory coordination using a time-delayed synchronization formulation, which enables a top-down modulation of heterogeneous neural spiking for a large-scale distributed system. Together, we devise a spiking-by-synchronization neural network (S2-Net) that uses rhythmic timing as a control mechanism for efficient information processing. Promising results have been achieved across a broad range of tasks, including neural activity decoding, energy-efficient signal processing, temporal binding and semantic reasoning.
[AI-217] MU-SHOT-Fi: Self-Supervised Multi-User Wi-Fi Sensing with Source-free Unsupervised Domain Adaptation
【速读】:该论文旨在解决基于WiFi信道状态信息(CSI)的人类活动识别(HAR)中深度学习模型在跨环境、多用户场景下泛化能力差的问题,尤其在源域标签数据受限(如隐私限制)时难以进行有效迁移。其核心挑战在于:多用户活动导致CSI信号纠缠、域偏移显著,且缺乏目标域标注数据。解决方案的关键在于提出一种无源域监督的无监督域适应框架MU-SHOT-Fi,通过两个创新机制实现稳定适配:一是采用占用加权的信息最大化策略,在不干扰主导类别的情况下对可能被占用的时间槽施加多样性正则化,防止模型坍塌;二是引入二值旋转预测作为空间自监督任务,利用CSI的频时结构学习域不变特征。此外,在单用户场景下进一步提出SU-SHOT-Fi,结合对比预测编码以增强时间一致性建模。实验表明,该方法在多个跨域设置下均能有效恢复多用户精确活动分类性能,同时保持高精度占用估计并避免向主导类别偏移。
链接: https://arxiv.org/abs/2605.01369
作者: Ahmed Y. Radwan,Hina Tabassum
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep learning has been widely adopted for WiFi CSI-based human activity recognition (HAR) due to its ability to learn spatio-temporal features in a privacy-preserving and cost-effective manner. However, DL-based models generalize poorly across environments, a challenge amplified in multi-user settings where overlapping activities cause CSI entanglement and domain shifts. Practical deployments often limit access to labeled source data due to privacy constraints, motivating source-free adaptation using only unlabeled target-domain CSI and a pre-trained source model. In this paper, we propose MU-SHOT-Fi, a source-free unsupervised domain adaptation framework for single- and multi-user Wi-Fi sensing. MU-SHOT-Fi employs permutation-invariant set prediction with Hungarian matching during source training, followed by frozen-classifier backbone adaptation in the target domain. To enable stable adaptation without labels, we introduce occupancy-weighted information maximization that prevents model collapse by focusing diversity regularization on likely-occupied slots while excluding the dominant class from marginal entropy. Additionally, we employ binary rotation prediction as spatial self-supervision that exploits CSI frequency-time structure to learn domain-invariant features. For single-user scenarios, we introduce SU-SHOT-Fi by replacing occupancy weighting with standard information maximization and incorporating contrastive predictive coding to exploit temporal consistency. Extensive experiments on the WiMANS and Widar 3.0 datasets across cross-environment, cross-frequency, cross-orientation, and combined domain shifts demonstrate that MU-SHOT-Fi effectively recovers multi-user exact-activity classification performance under large domain shifts while maintaining accurate occupancy estimation and preventing collapse toward dominant classes.
[AI-218] Spectral- and Energy-efficient Multi-BS Multi-RIS Pinching-antenna Systems: A GNN-based Approach
【速读】:该论文旨在解决多基站(multi-BS)与多可重构智能表面(multi-RIS)辅助的针状天线(PA)系统中联合优化问题,即如何在满足互PA间距、功率预算及单位模相移约束条件下,通过协同下行传输最大化系统和速率(SR)与能量效率(EE)。其关键解决方案是提出一种三阶段图神经网络(GNN),该模型融合异构与同构图表示结构,并以端到端无监督方式训练,从而有效处理PA位置部署、RIS相位配置、发射波束赋形及基站-用户设备(BS-UE)关联等高度耦合的混合变量优化问题。数值结果表明,该方案在性能、泛化能力和推理速度方面均显著优于现有基准方法。
链接: https://arxiv.org/abs/2605.01307
作者: Changpeng He,Yang Lu,Wei Chen,Bo Ai,Arumugam Nallanathan,Zhiguo Ding
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:This paper investigates coordinated downlink transmission in a multi-base station (multi-BS) multi-reconfigurable intelligent surface (multi-RIS)-assisted pinching-antenna (PA) system, where each user equipment (UE) is associated with a single BS and each BS is equipped with movable PAs deployed on parallel waveguides. We formulate sum rate (SR) and energy efficiency (EE) maximization problems by jointly optimizing PA placement, RIS phase shifts, transmit beamforming, and BS-UE association under constraints of inter-PA spacing, power budget, and unit-modulus phase shift. To address the resulting highly coupled mixed-variable problem, we propose a three-stage graph neural network (GNN) that integrates heterogeneous and homogeneous graph representations and is trained end-to-end in an unsupervised manner. Extensive numerical results demonstrate that the proposed three-stage GNN consistently outperforms representative system and learning baselines, generalizes well to unseen numbers of UEs, RISs, and BSs, and maintains millisecond-level inference time. Besides, the results validate the effectiveness of the proposed design from both system and architectural perspectives. Moreover, PAs are shown to enhance SR and EE, and the performance gain is enlarged with increasing number of PAs.
[AI-219] A Target-Free Harmonization Method for MRI
【速读】:该论文旨在解决多中心磁共振成像(MRI)中存在的域偏移(domain shift)问题,即由于扫描参数、序列或设备差异导致同一受试者在不同机构获取的图像外观不一致,从而影响深度学习模型在跨机构数据上的泛化性能。传统图像调和方法通常依赖于目标域数据进行训练或测试,这会引发患者隐私泄露风险并限制其在临床环境中的部署。本文提出的解决方案——TgtFreeHarmony,关键在于无需访问目标域数据即可实现图像调和:通过基于解耦生成器构建的MRI域风格流形,利用贝叶斯优化搜索最接近目标域风格的图像表示,该优化过程由一个在目标域数据上训练的下游任务模型(如脑组织分割)的性能引导,从而实现源域图像向目标域风格的有效迁移,同时保留生物结构信息,为保护数据隐私的临床级图像调和提供了新范式。
链接: https://arxiv.org/abs/2605.01282
作者: Minjun Kim(1),Dong Ju Mun(1),Hwihun Jeong(2),Hangyeol Park(1),Haechang Lee(1),Se Young Chun(1),Jongho Lee(1) ((1) Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea, (2) Department of Psychiatry, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 37 pages, 10 figures
Abstract:In MRI, variations in scan parameters, sequence, or hardware can lead to discrepancies in image appearance, even for the same subject. These inconsistencies, known as domain shifts, can hinder image analysis and degrade the performance of deep learning models trained on data from specific target domains. MRI image harmonization aims to address these issues by aligning source domain images to the target domain images while preserving biological information such as anatomical structures. However, most existing harmonization approaches require access to both source and target domain data in training or test time. This dependence induces data sharing between institutions, raising concerns about patient privacy and substantially limiting the harmonization approaches that can be practically deployed in clinical settings. To overcome these limitations, we introduce TgtFreeHarmony, the harmonization framework tailored for target-free scenarios, eliminating the need for target domain data and any data sharing, enabling privacy-preserving harmonization directly within the source institution. Our approach estimates the target domain style by searching the manifold of MRI domain style constructed via a disentanglement-based generator using Bayesian optimization guided by the performance of a downstream task model, which is trained on target domain data. We evaluated our method on the brain tissue segmentation task across multiple institutes and demonstrated that it effectively harmonizes source images into target images, leading to improved downstream task performance. By enabling harmonization without any access to target-domain data, TgtFreeHarmony establishes a new direction of harmonization preserving data privacy that can be realistically deployed within clinical environments.
[AI-220] Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models
【速读】:该论文旨在解决现有无线基础模型在信道状态信息(CSI)建模中因采用静态或一维位置编码而导致的外推能力和零样本泛化性能不足的问题。这些问题源于传统位置编码未能体现无线信道内在的物理特性,如相对衰减规律、三维时空频结构以及场景依赖性,从而限制了模型对不同天线规模、移动场景和频段的适应能力。解决方案的关键在于提出一种物理对齐的自适应3D-RoPE(Adaptive 3D-Rotary Positional Encoding)机制,其核心创新包括:一个可学习的轴解耦三维频率库,用于显式解耦多维相位依赖关系;以及一个轻量级的、基于通道条件的控制器,通过紧凑的全局CSI描述符动态调节先验信息,使位置编码从静态组件转变为具备相干感知能力的动态归纳偏置,从而有效捕捉异构信道物理特性并提升模型的泛化与外推性能。
链接: https://arxiv.org/abs/2605.00968
作者: Chenyu Zhang,Xinchen Lyu,Chenshan Ren,Shuhan Liu,Qimei Cui
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures
Abstract:Positional encoding plays a pivotal role in determin?ing the extrapolation and generalization performance of wireless foundation models for channel state information (CSI) modeling, latent characterization, and task-specific prediction. However, existing CSI models inherit static or one-dimensional positional priors from natural language and vision architectures, which fundamentally misalign with the intrinsic physics of wireless channels by lacking explicit relative decay, collapsing the 3D spatio-temporal-frequency structure, and remaining scenario?rigid. This paper proposes Adaptive 3D-RoPE, a physics-aligned rotary positional encoding that establishes the structural corner?stone for wireless foundation models. The framework integrates a learnable, axis-decoupled 3D frequency bank to explicitly disentangle multi-dimensional phase dependencies, coupled with a lightweight channel-conditioned controller that dynamically modulates the prior via compact global CSI descriptors. This sample-adaptive mechanism transforms positional encoding from a static transformer component into a dynamic, coherence-aware inductive bias to resolve heterogeneous channel physics. Extensive experiments across 100 datasets demonstrate the superiority of the proposed scheme in both scale extrapolation and zero-shot generalization. Compared to the state-of-the-art, our method achieves up to a 10.7 dB reduction in normalized mean square error (NMSE) under 8 times antenna scale extrapolation. Given the same CSI input scales, our method can also improve zero-shot NMSE by 1.07 dB across unseen mobility scenarios and 0.90 dB in low-frequency-to-millimeter-wave tasks.
[AI-221] Co-Generative De Novo Functional Protein Design
【速读】:该论文旨在解决从头功能蛋白设计(de novo functional protein design)中难以同时实现功能性和折叠性(foldability)的问题。现有方法要么采用直接的功能到序列映射,要么使用结构与序列解耦生成策略,但往往无法兼顾两者。其解决方案的关键在于提出了一种协同生成蛋白语言模型 CodeFP(Co-generative protein language model),该模型能够同时解码序列和结构标记(token),并通过引入功能局部结构来增强功能语义编码,从而改善扁平编码向结构标记的次优转换问题;同时,通过辅助功能监督机制缓解因“多对一”结构到标记映射导致的训练歧义,最终在功能性一致性和折叠性上显著优于最强基线模型。
链接: https://arxiv.org/abs/2605.00948
作者: Xinrui Chen,Yizhen Luo,Siqi Fan,Zaiqing Nie
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:De novo functional protein design aims to generate protein sequences that realize specified biochemical functions without relying on evolutionary templates, enabling broad applications in biotechnology and medicine. Existing approaches adopt either direct function-to-sequence mapping or decoupled structure-sequence generation strategies but often fail to achieve functionality and foldability simultaneously. To address this, we propose CodeFP, a Co-generative protein language model for de novo Functional Protein design that simultaneously decodes sequence and structure tokens, thereby enabling superior simultaneous realization of functionality and foldability. CodeFP utilizes functional local structures to enrich functional semantic encodings, overcoming the suboptimal translation of flat encodings into structure tokens, while introducing auxiliary functional supervision to alleviate training ambiguity stemming from the one-to-many structure-to-token mapping. Extensive experiments show that CodeFP consistently achieves average improvements of 6.1% in functional consistency and 3.2% in foldability over the strongest baseline.
[AI-222] CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
【速读】:该论文旨在解决单细胞多组学数据(包括转录组、染色质可及性和表面蛋白组)与空间组学数据在统一表示空间中的整合难题,以及现有模型在基因扰动预测中因突变式token操作导致的分布外伪影问题。其核心解决方案是提出CellxPert——一个可扩展的多模态基础模型,通过联合编码scRNA-seq、ATAC-seq和CITE-seq数据,并直接嵌入MERFISH和成像质谱流式(imaging mass-cytometry)的空间视觉层,实现跨模态对齐;关键创新在于采用基于Metropolis-Hastings采样的扰动推理机制,利用模型自身的掩码条件分布作为提议核,在马尔可夫链过程中生成生物可解释的转录组状态轨迹,从而避免传统方法中因突变式token处理带来的偏差,显著提升细胞类型注释精度、扰动响应预测能力和多组学整合效率。
链接: https://arxiv.org/abs/2605.00930
作者: Andac Demir,Erik W. Anderson,Jeremy L. Jenkins,Srayanta Mukherjee
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we introduce CellxPert, a scalable multimodal foundation model that unifies single-cell and spatial multi-omics within a common representation space. CellxPert jointly encodes transcriptomic (scRNA-seq), chromatin-accessibility (ATAC-seq), and surface-proteomic (CITE-seq) measurements, while directly incorporating MERFISH and imaging mass-cytometry data as 2D or 3D spatial-visual layers. CellxPert facilitates four key downstream tasks out of the box: (i) cell-type annotation across a broad ontology of 154 largely overlapping identities – the largest label space addressed to date and a stringent test of fine-grained discrimination, (ii) efficient fine-tuning using Low Rank Adaptation (LoRA), (iii) genome-wide transcriptomic response prediction to in-silico perturbations (ISP), and (iv) seamless multi-omic integration across various assays and platforms. Unlike current single-cell foundation models, which approximate gene perturbations by deleting or reordering tokenized gene expression ranks, CellxPert employs a Metropolis-Hastings sampler whose proposal kernel uses the model’s masked conditional distributions to transition to new transcriptomic states conditioned on the perturbed genes. This Markov-chain procedure mitigates out-of-distribution artifacts introduced by abrupt token manipulation and produces trajectories that are biologically interpretable. Evaluations on PBMC68K, Replogle Perturb-seq, Systema, and BMMC benchmarks show that CellxPert surpasses classical and state-of-the-art baselines in cell-type annotation, perturbation response prediction, and multi-omic integration.
[AI-223] ransfer Learning for Tonal Noise Prediction in VRF Units Using Thermodynamic and Vibration Signals
【速读】:该论文旨在解决变制冷剂流量(VRF)室外机组中由双转子压缩机产生的二次谐波(2f)噪声预测难题,该噪声作为主要低频噪声源,其幅值随环境热负荷和阀门开度剧烈波动,传统基于机理的模型难以准确评估。解决方案的关键在于提出一种基于域不变偏最小二乘法(Domain-invariant Partial Least Squares, Di-PLS)的无监督迁移学习方法,通过提取跨工况共性特征并最小化源域与目标域之间的分布差异,显著提升模型在新工况下的泛化能力;实验表明,基于加速度信号构建的Di-PLS模型预测误差始终控制在3 dB以内,优于基于热力学信号的模型,揭示了结构振动相较于热力学状态对声辐射具有更强且更直接的因果关联。
链接: https://arxiv.org/abs/2605.00895
作者: ZhiWei Su,Ding Wang,Yuan Guo,Yang Qiao,HongJun Cao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The second-order harmonic (2f) component generated by twin-rotary compressor is a dominant low-frequency noise source of variable refrigerant flow (VRF) outdoor units, yet its amplitude fluctuates strongly with environmental thermal load and valve opening, making it difficult to assess accurately using conventional mechanism-based models. This paper proposes an unsupervised transfer learning method based on Domain-invariant Partial Least Squares (Di-PLS) to accurately predict 2f noise levels under new conditions using different signals. Prediction models utilizing thermodynamic signals and acceleration signals are constructed respectively, and the generalization performance of the proposed Di-PLS is systematically compared with traditional Partial Least Squares (PLS). Results demonstrate that Di-PLS significantly outperforms PLS by extracting cross-condition common features and minimizing the distribution discrepancy between the source and target domains. Specifically, the acceleration-based Di-PLS model achieves the best performance, maintaining prediction errors within 3 dB for all test cases. This superiority over thermodynamic-based models highlights a physical insight: while thermodynamic states drive dynamic changes, structural vibration possesses a stronger and more direct causal link to acoustic radiation.
[AI-224] An Algorithm for On-Sensor Agnostic Detection of Changes in Human Activity for Ultra-Low-Power Applications
【速读】:该论文旨在解决可穿戴设备在运行基于惯性测量单元(Inertial Measurement Units, IMUs)的人体活动识别(Human Activity Recognition, HAR)时存在的能源浪费问题,即系统对每个时间窗口持续执行分类任务,即使在长时间无活动变化的情况下也是如此。解决方案的关键在于提出一种轻量级的“变化检测门”(change-detection gate),该门采用非参数化动态模板匹配算法,每步仅需约16k FLOPs,无需离线训练且不依赖预定义的目标活动类别;当检测到真实活动变化时才触发完整的HAR网络,从而在真实监测场景中将计算负载降低超过67%。该方法在智能眼镜、智能手表和智能手机数据上均表现出高灵敏度(UCA-EHAR数据集达98%,WISDM数据集达97%)与良好特异性(分别为75%和76%),验证了其鲁棒性和跨设备适应能力。
链接: https://arxiv.org/abs/2605.00870
作者: Sara Rimoldi,Arianna De Vecchi,Hazem Hesham Yousef Shalby,Federica Villa
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to 2026 International Conference on Automatic Face and Gesture Recognition (FG)
Abstract:Wearable devices running Human Activity Recognition(HAR) on Inertial Measurement Units~(IMUs) waste energy by performing continuous classification for each window, even during long periods of unchanged activity. We address this with a lightweight change-detection gate: a non-parametric algorithm based on dynamic template matching that runs continuously at only approximately 16kFLOPs per step, requires no offline training, and does not need prior definition of target activity classes. The gate invokes the full HAR network only when it detects an activity change, reducing the computational load by over 67% in realistic monitoring settings. The algorithm is evaluated on smart glasses, smartwatch, and smartphone data, requiring only a brief device-specific calibration phase. The gate achieves 98% sensitivity on UCA-EHAR, ensuring no genuine activity transition is missed, while 75% specificity keeps unnecessary HAR invocations low. Results on WISDM are 97% sensitivity and 76% specificity, demonstrating robustness and flexibility to various settings.
[AI-225] Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
【速读】:该论文旨在解决文本到语音(Text-to-Speech, TTS)合成质量评估缺乏系统性指标的问题,提出以“声纹映射”(voice mapping)作为评估框架,通过量化声学特征来客观衡量TTS模型的语音自然度与表现力。其解决方案的关键在于引入三个核心声学指标:峰因子(crest factor)、频谱平衡(spectrum balance)和倒谱峰值显著度(cepstral peak prominence, CPP),并基于这些指标对六种代表性TTS模型进行对比分析,揭示不同模型在语音动态范围、软发音处理能力及人声自然度方面的差异,从而为TTS系统的优化提供可量化的依据。
链接: https://arxiv.org/abs/2605.00861
作者: Huanchen Cai,Sten Ternström
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:This study investigates voice mapping as an evaluation framework for text-to-speech (TTS) synthesis quality. The study analyzes six TTS models, including historical and recent ones. The metrics are crest factor, spectrum balance, and cepstral peak prominence (CPPs). We investigated 6 influential TTS models: Merlin, Tacotron 2, Transformer TTS, FastSpeech 2, Glow-TTS, and VITS. The results demonstrate that voice range serves as a primary indicator of model capability, with VITS showing the largest range among tested models. Glow-TTS exhibited superior performance in soft phonation, indicated by higher spectrum balance, despite limited voice range. The results showed that the CPPs values between 7-8 dB indicate natural voice quality, while with CPPs exceeding 10 dB, the speech tends to sound robotic. These findings underscore the need for voice mapping to evaluate vocal effort, and capture how TTS systems handle voice dynamic and expressiveness.
[AI-226] Foundation Model Guided Dual-Branch Co-Adaptation for Source-Free EEG Decoding
【速读】:该论文旨在解决源域不可获取条件下跨被试脑电(EEG)解码的泛化能力不足问题,即在无源数据访问的情况下,如何利用预训练模型实现更可靠的跨域迁移。现有方法受限于源预训练模型内部知识的局限性,导致伪标签质量低、误差累积严重,进而影响性能。其解决方案的关键在于提出FUSED框架,通过双分支协同适应机制(Foundation-guided Source-free EEG Decoding),将大规模EEG基础模型(EEG Foundation Model, FM)与轻量级专业模型(Specialist Model, SM)结合:一方面引入线性与原型视角的协同适应机制以生成跨分支伪标签,另一方面设计共识过滤机制和两阶段伪标签精炼策略抑制错误传播;最终通过互信息最大化校准FM决策边界,并实施从FM到SM的知识蒸馏,形成“校准-蒸馏”范式,从而实现高效、鲁棒且隐私保护的跨被试EEG解码。
链接: https://arxiv.org/abs/2605.00857
作者: Peiliang Gong,Han Zhang,Zhen Jiang,Chenyu Liu,Ziyu Jia,Xinliang Zhou,Daoqiang Zhang,Xiaoli Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Source-free domain adaptation (SFDA) provides a practical solution to cross-subject EEG decoding by adapting source-pretrained models to unlabeled target domains without accessing source data. However, existing SFDA methods rely solely on the limited internal knowledge of source-pretrained models, leading to inferior cross-domain generalization and unreliable pseudo-labels. Although EEG Foundation Models (FMs) pretrained on large-scale data exhibit strong generalizability, their potential in SFDA remains largely unexplored. To this end, we propose FUSED, a Foundation-guided Source-free EEG Decoding framework that integrates a large-scale FM with a compact Specialist Model (SM) via dual-branch co-adaptation. Specifically, we introduce a Co-adaptation mechanism equipping both branches with linear and prototype views, enabling cross-branch pseudo-label generation. Additionally, we design a Consensus Filtering Mechanism that exploits the FM’s inherent stability to identify high-quality samples, along with a Two-Stage Pseudo-Label Refinement scheme to suppress error accumulation through cross-branch arbitration. Finally, we calibrate the FM’s decision boundaries via mutual information maximization with the SM, followed by knowledge distillation from FM to SM, forming a principled calibrate-then-distill pipeline. To our knowledge, FUSED is the first work to leverage EEG FMs within the SFDA framework for cross-subject EEG decoding. Extensive experiments across three EEG paradigms, including motor imagery, emotion recognition, and SSVEP, demonstrate consistent state-of-the-art performance, validating the effectiveness of foundation-guided synergy for robust and privacy-preserving EEG decoding.
[AI-227] Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
【速读】:该论文旨在解决传统气象模型在跨任务泛化能力不足、数据异构性处理困难以及极端天气事件预测精度有限的问题。其核心解决方案是提出地球系统基础模型(Earth System Foundation Model, ESFM),通过引入多模态数据兼容的编码方案与训练协议,支持包括卫星遥感、地面站点等存在时空缺失值的数据统一建模;采用轴向注意力机制(axial attention)捕捉变量间复杂依赖关系,并结合个体变量分词策略实现训练过程中变量组合灵活调整,从而提升模型对无观测区域或压力层的预测能力;同时利用基于自适应层归一化的集成方法将确定性模型转化为概率性模型,增强不确定性量化能力。上述设计使ESFM在保持长期稳定性的同时,显著拓展了下游任务适配范围并提升了极端天气事件(如超强台风和突发平流层变暖)的定位与强度预测准确性。
链接: https://arxiv.org/abs/2605.00850
作者: Firat Ozdemir,Yun Cheng,Salman Mohebi,Fanny Lehmann,Simon Adamov,Zhenyi Zhang,Leonardo Trentini,Dana Grund,Oliver Fuhrer,Torsten Hoefler,Siddhartha Mishra,Sebastian Schemm,Benedikt Soja,Mathieu Salzmann
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: ESFM is available on this https URL . 48 pages, 29 figures, 18 tables
Abstract:Foundation models (FMs) for the Earth system learn statistical relationships between physical variables across massive datasets to enable versatile downstream applications through finetuning, separating them from task-specific weather models. Here, we introduce Earth System Foundation Model (ESFM), a fully open model building on the 3D Swin UNet backbone of the pioneering Aurora model. ESFM introduces extensions that increase functionality and foster adoption in climate sciences. First, the encoding scheme and training protocols have been extended to handle diverse datasets, including those containing missing values across all spatio-temporal dimensions such as satellite data, as well as station data, all under one backbone. Axial attention is introduced to capture inter-variable dependencies. As a result ESFM skillfully predicts variables in regions or on pressure levels where no data is present at the initial time, while preserving inter-variable relationships, for example between temperature, pressure, and humidity. Individual variable tokenization enables different sets of variables to be shuffled during training and simplifies the process of building extensions for new downstream tasks. Adaptive layer norm-based ensembles allow for a simple yet effective way to transform deterministic ESFM to a probabilistic FM. We present findings using dense gridded data (ERA5, CMIP6), regionally masked dense data, sparse gridded MODIS satellite data, and station data. Results demonstrate competitive or superior performance relative to state-of-the-art benchmarks. Case studies of Super Typhoon Doksuri (2023) and 2024 sudden stratospheric warming events show accurate positional and magnitude estimations of extreme weather. ESFM retains the strengths of previous foundation models, such as long-term stability, but facilitates application to a variety of downstream tasks.
机器学习
[LG-0] Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics
链接: https://arxiv.org/abs/2605.02884
作者: Bogdan Oancea
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring the coherence of regional socio-economic statistics is a central task for national statistical institutes. Traditional validation tools, such as range edits, ratio checks, or univariate outlier detection, are effective for identifying extreme values in individual series but are less suited for detecting unusual combinations of indicators in high-dimensional settings. This paper proposes an unsupervised machine learning framework for identifying structurally atypical regional profiles within Europe using publicly available Eurostat data. We construct a cross-sectional dataset of NUTS2 regions (2022) covering four key indicators: GDP per capita in PPS, unemployment rate, tertiary educational attainment, and population density. We apply and compare five anomaly detection techniques, univariate z-scores, Mahalanobis distance, Isolation Forest, Local Outlier Factor, and One-Class SVM, and classify a region as a structural anomaly if it is flagged by at least three of the five methods. The findings show that machine learning methods identify a consistent set of regions whose multivariate profiles diverge substantially from the EU-wide pattern. These include both highly developed metropolitan economies (Brussels, Vienna, Berlin, Prague) and regions with persistent socio-economic disadvantages (Central and Western Slovakia, Northern Hungary, Castilla-La Mancha, Extremadura), as well as Istanbul, whose profile differs markedly from EU capital regions. Importantly, these anomalies do not necessarily signal data quality issues; rather, they reflect meaningful structural divergence that warrants analytical or policy attention. The proposed framework is fully reproducible, scalable, and compatible with existing validation workflows, offering a flexible tool for early detection of unusual regional configurations within the European Statistical System.
[LG-1] rust but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
链接: https://arxiv.org/abs/2605.02853
作者: Arian Eamaz,Farhang Yeganegi,Mojtaba Soltanalian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding whether deep neural networks are effectively optimized remains challenging, as training occurs in highly nonconvex landscapes and standard metrics provide limited visibility into layer-wise learning quality. This challenge is particularly acute for transformer-based language models, where training is expensive, models are often reused in frozen form, and poorly optimized layers can silently degrade performance. We propose a layer-wise peeling framework for monitoring training dynamics, in which each transformer layer is locally optimized against intermediate representations of the trained model. By constructing lightweight, layer-specific reference solutions and projecting layers onto multiple intermediate outputs via different permutations, we obtain achievable baselines that enable fine-grained diagnosis of under-optimized layers. Experiments on decoder-only transformer models show that these layer-wise reference bounds can match or even surpass the trained model at various stages of training, exposing inefficiencies that remain hidden in aggregate loss curves. We further demonstrate that this analysis remains effective under binarization and quantized settings, where training dynamics are particularly fragile. Across all numerical results, the proposed bounds consistently separate apparent convergence from effective optimality, highlighting optimization opportunities that are invisible when relying on training loss alone.
[LG-2] A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification
链接: https://arxiv.org/abs/2605.02836
作者: Sushovan Majhi,Atish Mitra,Žiga Virk,Pramita Bagchi
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:
Abstract:We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees – a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate – are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; closed-form weights maximize a structural distortion constant \lambda(\nu) (a Lipschitz lower bound on \mathcalD_n under non-interference). (i) An O(kR/(\Delta\sqrtm_\min)) margin bound, driven by class-mean separation \Delta and embedding radius R , matched by a sample-starved minimax lower bound. (ii) The Mahalanobis margin under Ledoit-Wolf-shrunk covariance is the strongest closed-form descriptor selector on a heterogeneous 64-descriptor chemical-graph pool (mean Spearman \rho \approx +0.54 across 10 benchmarks, positive on 9 of 10); the isotropic surrogate \Delta/\sqrt\ell admits a closed-form selection-consistency rate on homogeneous (14-15 descriptor) protein/social pools. (iii) A training-time-decided certificate with no per-prediction overhead, in non-asymptotic Pinelis and asymptotic Gaussian plug-in forms. Empirically, PLACE is the strongest diagram-based method on Orbit5k and matches the strongest topology-based baseline within statistical noise on MUTAG and COX2. The remaining gaps fall into two diagnosable regimes: descriptor blindness on NCI1/NCI109, and pool-coverage limits elsewhere. Both radii exceed the firing threshold \hat\Delta/2 on every benchmark at our training-set sizes, dominated by the \sqrt\ell scaling of the multivariate-norm bound; the per-prediction certificate is constructive but not yet operational at these sizes.
[LG-3] Adaptive Interpolation-Synthesis for Motion In-Betweening on Keyframe-Based Animation SIGGRAPH2026
链接: https://arxiv.org/abs/2605.02742
作者: Anton Raël,Julien Boucher,Antoine Lhermitte
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted to SIGGRAPH 2026 Conference Papers
Abstract:Motion in-betweening is one of the most artistically demanding and time consuming stages of 3D animation, where the expressivity and rhythm of motion are defined. The level of creative control it requires makes it a major production bottleneck, underscoring the need for intelligent tools that assist animators in this process. Although recent deep learning approaches have achieved strong results in motion synthesis and in-betweening, they assume data characteristics, motion styles, and problem formulations that diverge from professional animation workflows. To bridge this gap, we propose a method explicitly aligned with the constraints of motion in-betweening for keyframe-based animation in production environments. At its core, the Adaptive Interpolation-Synthesis (AIS) layer mirrors the animator’s creative process by dynamically balancing learned interpolation and direct pose synthesis. In addition, a domain-based input keypose schedule reflects the distribution of production data, improving stylistic consistency and alignment between training and real-world usage. Our method achieves state-of-the-art performance on production data; when integrated into Autodesk Maya, it enables animators to complete in-betweening tasks with a 3.5x speedup.
[LG-4] Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLM s
链接: https://arxiv.org/abs/2605.02735
作者: Xin Zhang,Qiqi Tao,Jiawei Du,Moyun Liu,Joey Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent–visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.
[LG-5] Federated Reinforcement Learning for Efficient Mobile Crowdsensing under Incomplete Information
链接: https://arxiv.org/abs/2605.02705
作者: Sumedh J. Dongare,Patrick Weber,Andrea Ortiz,Walid Saad,Oliver Hinz,Anja Klein
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Mobile crowdsensing (MCS) is a distributed sensing architecture that utilizes existing sensors on mobile units (MUs) to perform sensing tasks. A mobile crowdsensing platform (MCSP) publishes the sensing tasks and the MUs decide whether to participate in exchange for money. The MCS system is dynamic: the task requirements, the MUs’ availability, and their available resources change over time. The MUs aim to find an efficient task participation strategy to maximize their income while the MCSP focuses on maximizing the number of completed tasks. As optimal strategies require perfect non-causal information about the MCS system, which is unavailable in realistic scenarios, the main challenge is to find an efficient task participation strategy for the MUs under incomplete information. To this end, a novel fully decentralized federated deep reinforcement learning algorithm, FDRL-PPO, is proposed. FDRL-PPO enables every MU to learn its own task participation strategy based on its experiences, available resources, and preferences, without relying on perfect non-causal information about the MCS system. To replenish their batteries, the MUs rely on energy harvesting. As a result, their available energy varies over time, leading to varying availability and fragmented learning experiences. To mitigate these challenges, the proposed approach leverages federated learning, enabling MUs to collaboratively improve their models without sharing private raw data like their own experiences. By exchanging only learned models, MUs collectively compensate for individual limitations, and find more scalable, robust, and efficient task participation strategies. Comprehensive evaluations on both synthetic and real-world datasets show that FDRL-PPO consistently outperforms benchmark algorithms in terms of task completion ratio, fairness in task completion, energy consumption, and number of conflicting proposals.
[LG-6] MSMixer: Learned Multi-Scale Temporal Mixing with Complementary Linear Shortcut for Long-Term Time Series Forecasting
链接: https://arxiv.org/abs/2605.02689
作者: Ahmed Cherif
类目: Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 8 tables. Submitted to International Journal of Machine Learning and Cybernetics (Springer)
Abstract:Long-term time series forecasting requires models that simultaneously capture rapid oscillations, medium-range periodicities, and slowly evolving macro-trends from a fixed look-back window. Existing lightweight MLP-based models typically operate on a single temporal resolution, limiting their ability to explicitly model patterns at multiple scales. We propose MSMixer, a channel-independent multi-scale MLP architecture that addresses this limitation through three complementary innovations: (i) three parallel scale branches at down-sample factors 1x, 4x, 16x with independent MLP blocks, (ii) a learnable softmax gate that dynamically weighs branch outputs, and (iii) a DLinear complementary shortcut that provides full-window trend and seasonality context. MSMixer contains only 112K parameters at H=96 and runs at O(T) complexity. Evaluated on four ETT benchmarks with standard chronological splits and three random seeds, MSMixer achieves the lowest average MSE (0.357) among lightweight models, outperforming DLinear (0.386, -7.4%) and NLinear (0.365, -2.1%), winning 12 of 16 configurations. Against five Transformer-based baselines from the literature, MSMixer achieves best or second-best MSE in 9 of 16 configurations while using 5x fewer parameters than PatchTST. Ablation and sensitivity analyses confirm the complementary contributions of the multi-scale branches and the DLinear shortcut.
[LG-7] Spectral Model eXplainer: a chemically-grounded explainability framework for spectral-based machine learning models
链接: https://arxiv.org/abs/2605.02684
作者: Jose Vinicius Ribeiro,Rafael Figueira Goncalves,Fabio Luiz Melquiades,Sylvio Barbon Junior
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:
Abstract:Spectral-based machine learning models have been increasingly deployed in chemometrics and spectroscopy, where predictive accuracy is as important as explainability. Current employed eXplainable Artificial Intelligence (XAI) methods are largely adapted from tabular or generic multivariate domains, assigning relevance to isolated spectral variables rather than to the chemically meaningful spectral zones. Widely adopted tools such as SHapley Additive exPlanations (SHAP), Permutation Feature Importance (PFI), and Variable Importance in Projection scores (VIP) were not designed for the physical continuity and high collinearity of spectral data, and their variable-level outputs require post-hoc aggregation to recover zone-level information. This study introduces the Spectral Model eXplainer (SMX), a post-hoc, global, model-agnostic XAI framework that explains spectral classifiers through expert-informed spectral zones. SMX summarizes each zone via PCA, defines quantile-based logical predicates, estimates predicate relevance with perturbation in stochastic subsamples, and aggregates bag-wise rankings in a directed weighted graph summarized by Local Reaching Centrality. A key component is threshold spectrum reconstruction, which back-projects predicate boundaries to the original spectral domain in natural measurement units, enabling direct visual comparison with measured spectra. The method was evaluated on eight real spectral datasets (six based on X-ray Fluorescence–XRF and two based on Gamma-ray Spectrometry) and one synthetic benchmark with known gr
[LG-8] CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation ICML2026
链接: https://arxiv.org/abs/2605.02657
作者: Ziyang Yu,Yi He,Wenbing Huang,Wen Yan,Yang Liu
类目: Machine Learning (cs.LG)
*备注: ICML 2026 poster
Abstract:Estimating free energy differences quantifies thermodynamic preferences in molecular interactions, which is central to chemistry and drug discovery. Despite fruitful progress, existing methods still face key limitations: classical computational approaches remain prohibitively expensive due to their reliance on extensive molecular dynamics simulations, while deep learning-based methods are constrained by either less-expressive generative models or input dimensions tied to a specific system, resulting in negligible generalization. To address these challenges, we propose CARD, a generative framework that employs a novel radix-based decomposition to bijectively convert 3D coordinates into mixed discrete-continuous sequences, enabling coarse-to-fine autoregressive modeling with enhanced expressiveness. Notably, the model corresponds to a distribution with zero free energy, serving as a proposal for absolute free energy computation of arbitrary systems without relying on alchemical pathways. Experiments across diverse tasks demonstrate that CARD matches the accuracy of classical computational methods on unseen systems with diverse topologies, while achieving an approximately 40-fold speedup in inference.
[LG-9] ARA: Agent ic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review
链接: https://arxiv.org/abs/2605.02651
作者: Kevin Riehl,Andres L. Marin,Nikofors Zacharof,Fan Wu,Patrick Langer,Robert Jakob,Anastasios Kouvelas,Georgios Fontaras,Michail A. Makridis
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:
Abstract:Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a paper, ARA extracts a directed workflow graph linking sources, methods, experiments, and outputs, then evaluates its reconstructability using structural and content-based scores for reproducibility assessments. Experiments on 213 ReScience C articles - the largest cross-domain benchmark of human-validated computational reproducibility studies considered to date - demonstrate ARA’s generalizability and consistent workflow reconstruction and assessment across LLMs, model temperatures, and scientific domains. ARA achieves ~61% accuracy on three benchmarks, and the highest accuracy reported on ReproBench (60.71% vs. 36.84%) and GoldStandardDB (61.68% vs. 43.56%), highlighting its potential to complement human review at scale and enabling next-generation peer review. Code and Data available: this https URL.
[LG-10] CNNs for Vis-NIR Chemometrics: From Contradiction to Conditional Design
链接: https://arxiv.org/abs/2605.02636
作者: Dário Passos
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注: 19 pages, 1 figure, review article
Abstract:Near-infrared (NIR; a.k.a.\ NIRS) deep-learning studies in chemometrics increasingly report mutually inconsistent conclusions regarding convolutional neural network (CNN) design, including small versus large kernels, shallow versus deep architectures, raw spectra versus preprocessing, and single-domain training versus transfer learning. As a result, the same architecture can appear superior in one study and inferior in another, creating a practical impasse for chemometric practitioners. In this review, we argue that these contradictions are not evidence of irreconcilable methods but a structurally expected consequence of uncontrolled moderating variables. Specifically, we trace recurring disagreements to (i) the indirect nature of Vis–NIR measurement in water-dominated matrices, (ii) mismatch between effective receptive field (ERF) and the width of informative spectral structure, and (iii) validation design (including split strategy, hyperparameter tuning budget, and exposure to deployment-like shifts) acting as a hidden hyperparameter that can dominate model ranking. Building on evidence from published chemometrics and spectroscopy studies, we propose a conditional design framework that links architecture and preprocessing choices to spectral physics, dataset regime, and intended deployment scenario. Overall, the proposed perspective moves DL Chemometrics from template-driven architecture selection toward reproducible, physics-aware, and deployment-aligned model comparison.
[LG-11] Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
链接: https://arxiv.org/abs/2605.02626
作者: Inoussa Mouiche
类目: Machine Learning (cs.LG)
*备注: 21 pages
Abstract:Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise preferences, removing the need for reward modeling and policy optimization. However, recent work shows that DPO exhibits a squeezing effect, where negative gradients applied to rejected responses concentrate probability mass on high-confidence predictions while suppressing alternative responses. This phenomenon arises even in simple softmax models and can lead to systematic probability collapse during training. We introduce Gradient-Gated Preference Optimization (Gate-DPO), a method that stabilizes training by modulating rejected gradients according to the model’s probability geometry. When updates target extremely low-probability responses, the gate attenuates harmful gradients while preserving standard optimization behavior. Gate-DPO addresses this optimization pathology without modifying the underlying preference objective and is complementary to existing methods such as extended SFT, IPO, and Cal-DPO. Experiments across multiple architectures and preference datasets show that Gate-DPO consistently reduces squeezing and improves chosen-response likelihood. Mass-dynamics analysis further reveals healthier optimization behavior, with improved preferred responses and reduced suppression of the overall distribution. Notably, smaller gated models can exhibit stronger chosen-response improvements than larger ungated models, suggesting that controlling gradient dynamics, rather than scale alone, is key to stable and efficient alignment.
[LG-12] Selective Prediction from Agreement: A Lipschitz-Consistent Version Space Approach
链接: https://arxiv.org/abs/2605.02611
作者: Mohamadsadegh Khosravani
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider selective classification with abstention in the fixed-pool (or transductive) setting, where the unlabeled pool is given beforehand and only a subset of points can be queried for labels. Our main insight is to view selective prediction through agreement: given queried labels and Lipschitz margin constraints in an embedding space, the version space of Lipschitz-consistent classification heads is well defined. We obtain upper and lower Lipschitz margin bounds that define, for each pool point, a set of certified valid labels containing the prediction of every head in the version space. The model therefore predicts only when the label is forced (i.e., all consistent heads agree), and abstains otherwise. We also propose a monotone submodular geometric proxy for budgeted querying, and show that a greedy algorithm retains the standard approximation factor.
[LG-13] Gradient-Discrepancy Acquisition for Pool-Based Active Learning
链接: https://arxiv.org/abs/2605.02609
作者: Mohamadsadegh Khosravani,Sandra Zilles
类目: Machine Learning (cs.LG)
*备注:
Abstract:The effectiveness of active learning hinges on the choice of the acquisition criterion by which a learning algorithm selects potentially informative data points whose label is subsequently queried. This paper proposes a novel gradient-based acquisition criterion, derived from a generalization bound introduced by Luo et al. (2022). This criterion can be applied in lieu of uncertainty measures in uncertainty sampling, or incorporated into diversity-based methods that consider the spread of sampled points in addition to the uncertainty of their labels. We provide a theoretical justification of the proposed acquisition criterion, and demonstrate its effectiveness in an empirical evaluation.
[LG-14] Isotropic Fourier Neural Operators
链接: https://arxiv.org/abs/2605.02597
作者: Michael F. Staddon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fourier Neural Operators are deep learning models that learn mappings between function spaces and can be used to learn and solve partial differential equations (PDEs), in some cases significantly faster than traditional PDE solvers. Within the model are Fourier layers, which apply linear transformations directly to the Fourier modes, with parameters depending on the wave numbers. However, most physical systems are isotropic, with the results being independent of the coordinate system chosen, but the linear transformations do not necessarily respect these symmetries. We propose a modification to the linear transformations that ensures spatial symmetries are respected, called the Isotropic Fourier Neural Operator, which both improves model performance and reduces the number of parameters by up to a factor of 16 in 2D and 96 in 3D.
[LG-15] HARMES: A Multi-Modal Dataset for Wearable Human Activity Recognition with Motion Environmental Sensing and Sound
链接: https://arxiv.org/abs/2605.02596
作者: Robin Burchard,Pascal-André Brückner,Marius Bock,Juergen Gall,Kristof Van Laerhoven
类目: Machine Learning (cs.LG)
*备注:
Abstract:With each sensing modality exhibiting inherent strengths and limitations, multi-modal approaches for wearable Human Activity Recognition (HAR) are becoming increasingly relevant – particularly for recognizing Activities of Daily Living (ADLs), where individual modalities often produce ambiguous signals for similar or complex activities. This work introduces HARMES, a multi-modal wearable dataset combining three wrist-recorded modalities: motion sensing via an Inertial Measurement Unit (IMU), atmospheric environmental sensors (humidity, temperature, and pressure), and audio. Collected from 20 participants performing household activities in their own homes, HARMES totals over 80 hours of recorded data, with approximately three hours of labeled activity data per participant across 15 ADL classes. To the best of our knowledge, HARMES is the first dataset to combine this particular sensor trio, and it is nearly six times larger than the previously largest wrist-inertial-acoustic HAR dataset. In an extensive benchmark, we evaluate cross-subject generalization and conduct an ablation study revealing that modality contributions are activity-dependent and can provide complementary value, particularly for activities that are ambiguous from motion data alone. HARMES is freely available at Zenodo, alongside example code for loading the dataset and training models on GitHub.
[LG-16] Gradient Boosted Risk Scores
链接: https://arxiv.org/abs/2605.02593
作者: Costa Georgantas,Jonas Richiardi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables. We propose a simple and effective approach towards building compact and predictive risk scores. We provide an algorithm based on gradient boosting that is capable of modeling nonlinear effects, along with a C++ implementation with Python and R bindings. Through extensive empirical evaluation on twelve tabular datasets spanning regression, classification, and time-to-event tasks, we show that our method achieves competitive predictive performance while producing substantially more compact scores than regression-based alternatives, with 60% fewer rules for classification tasks and 16% fewer rules for time-to-event tasks on average, compared to AutoScore.
[LG-17] StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
链接: https://arxiv.org/abs/2605.02568
作者: Jaber Jaber,Osama Jaber
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 pages, 3 figures, 7 tables, 2 algorithms, 36 references. Memory-bounded indexer kernel for DeepSeek-V4 CSA via chunked partition-merge top-k. Code: this https URL
Abstract:DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang’s pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: this https URL.
[LG-18] Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators
链接: https://arxiv.org/abs/2605.02528
作者: Christian Jestel,Nicolas Bach,Marvin Wiedemann,Jan Finke,Peter Detzner
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation. We cross-evaluate five navigation policies on 1000 seeded maps per generator across three training seeds. Results show a strongly asymmetric cross-generator transfer: a specialist trained on sparse layouts falls to 3.3% success on mazes, whereas a policy trained on the combined generator set achieves 91.5 +/- 1.1% mean success. We further demonstrate that A* path-planner subgoal inputs are the dominant factor for robustness, raising success from the 90.2 +/- 1.4% feedforward baseline to 98.9 +/- 0.4% and outperforming GRU recurrence, which only improves the reactive baseline. The DRL policies outperform a classical Carrot+A* controller, which matches their success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s. This highlights learned speed adaptation as the decisive advantage of the learned approach. Real-world experiments on a RoboMaster confirm sim-to-real transfer in a cluttered arena, while a maze-like layout exposes remaining failure modes that recurrence helps mitigate.
[LG-19] Physics-Informed Neural Learning for State Reconstruction and Parameter Identification in Coupled Greenhouse Climate Dynamics
链接: https://arxiv.org/abs/2605.02524
作者: Sani Biswas,Khursheed J. Ansari,Md. Nasim Akhtar
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures
Abstract:Physics-informed neural networks (PINNs) have recently emerged as a promising framework for integrating data-driven learning with physical knowledge. In this work, we propose a coupled PINN approach for the joint reconstruction of indoor temperature and humidity dynamics in greenhouse environments, together with simultaneous identification of key model parameters. The method incorporates a reduced-order physically motivated model into the learning process, enabling consistent estimation under sparse and noisy observations. The artificial intelligence contribution lies in the development of a coupled physics-informed neural learning framework that integrates governing dynamical constraints into neural network training, while the engineering application focuses on greenhouse climate state reconstruction and parameter identification. The proposed framework is evaluated on a controlled synthetic benchmark that mimics diurnal forcing conditions. Compared with a purely data-driven neural network baseline, the coupled PINN achieves improved reconstruction accuracy, reducing temperature and humidity errors while maintaining high coefficients of determination. The improvement is particularly pronounced in the humidity channel, where latent moisture dynamics are more difficult to infer from limited measurements. In addition to accurate state reconstruction, the method successfully recovers the dominant physical parameters governing the system dynamics, demonstrating its ability to learn interpretable representations beyond data interpolation. These results highlight the potential of physics-informed learning for greenhouse climate modeling and, more broadly, for data-scarce environmental systems. Comments: 12 pages, 5 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.02524 [cs.LG] (or arXiv:2605.02524v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02524 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sani Biswas [view email] [v1] Mon, 4 May 2026 12:26:30 UTC (1,591 KB)
[LG-20] Evaluating Tabular Representation Learning for Network Intrusion Detection
链接: https://arxiv.org/abs/2605.02519
作者: Muhammad Usman Butt,Andreas Hotho,Daniel Schlör
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: IEEE International Conference on Cyber Security and Resilience (2026)
Abstract:Classic Network Intrusion Detection Systems (NIDS) often rely on manual feature engineering to extract meaningful patterns from network traffic data. However, this approach requires domain expertise and runs counter to the widely adopted principle of modern machine learning and neural networks: that models themselves should learn meaningful representations directly from data. We investigate whether tabular representation learning techniques can improve intrusion detection performance by automatically learning robust feature representations for NetFlow data. This paper presents a systematic evaluation of state-of-the-art representation learning methods on benchmark NetFlow datasets, comparing against traditional autoencoders and end-to-end transformer baselines. We evaluate learned representations using both supervised classifiers and unsupervised anomaly detectors, with comprehensive hyperparameter exploration for each combination. Our results reveal strong dataset-model dependency, with no single approach consistently dominating across all scenarios. For supervised classification, TabICL achieves the best performance on CIDDS, while autoencoders follow closely and tie with end-to-end transformer models for the best average rank across datasets. Supervised approaches substantially outperform unsupervised anomaly detection methods, where no single combination consistently dominates as optimal choices depend on the dataset. Cross-dataset transfer experiments demonstrate that learned representations can generalize across network environments with appropriate method and classifier selection. However, transfer performance varies substantially depending on the source-target dataset combination, indicating sensitivity to distributional differences between network environments.
[LG-21] MPCS: Neuroplastic Continual Learning via Multi-Component Plasticity and Topology-Aware EWC
链接: https://arxiv.org/abs/2605.02509
作者: Joern Hentsch
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Continual learning systems face a fundamental tension between plasticity – acquiring new knowledge – and stability – retaining prior knowledge. We introduce MPCS (Multi-Plasticity Continual System), a neuroplastic architecture that integrates eleven complementary mechanisms: task-driven neurogenesis, Fourier-encoded inputs, EWC regularization, meta-replay, mixed consolidation, hybrid gating, synapse pruning/regeneration, Hebbian updates, task similarity routing, adaptive growth control, and continuous neuron importance tracking. We evaluate MPCS on MEP-BENCH, a multi-track benchmark spanning 31 tasks across regression, classification, logic, and mixed domains, using a three-dimensional Pareto criterion over task performance (Perf), representation diversity (RD), and gradient conflict rate (GCR). Across 15 ablation configurations (3 seeds x 4 tracks x 2000 epochs), MPCS achieves a Normalized Efficiency Score of 94.2, placing it on the Pareto frontier among 9 of 14 gate-passing systems. Key findings: (i) Fourier encoding is the single most critical component (removal drops Perf by 30.7 pp and fails the MEP gate on 14% of tasks); (ii) global EWC degrades performance (NES = -4.2); topology-local EWC reduces this penalty (NES 90.5-91.8) but does not eliminate it; removing EWC entirely yields MPCS_EFFICIENT, the highest-Perf system – establishing a monotone relationship in the high task-similarity regime (s_bar ~= 0.95): global EWC topology EWC no EWC; (iii) the Pareto status assessment is predictive: removing the two Pareto-dominated components (EWC + Hebbian) jointly yields MPCS_EFFICIENT, which improves Perf by 0.6 pp at 4.7x lower compute cost (127 vs. 602 min), validating the Pareto frontier as an actionable model-compression guide.
[LG-22] Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
链接: https://arxiv.org/abs/2605.02435
作者: Mehryar Mohri,Jon Schneider,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen’s inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of \Theta(1/K^2) , which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.
[LG-23] Dueling DDQN-Based Adaptive Multi-Objective Handover Optimization for LEO Satellite Networks
链接: https://arxiv.org/abs/2605.02416
作者: Po-Heng Chou,Chiapin Wang,Chung-Chi Huang,Kuan-Hao Chen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 1 table, and submitted to 2026 IEEE Globecom
Abstract:In this paper, we propose a dueling double deep Q-network (DDQN)-based adaptive multi-objective handover framework for LEO satellite networks. The proposed method enables dynamic trade-off learning among throughput, blocking probability, and switching cost under time-varying network conditions. Simulation results demonstrate that the proposed approach consistently outperforms conventional baselines, achieving up to 10.3% throughput improvement and near-zero blocking under typical operating conditions.
[LG-24] Spatial-Temporal Learning-Based Distributed Routing for Dynamic LEO Satellite Networks
链接: https://arxiv.org/abs/2605.02413
作者: Po-Heng Chou,Chiapin Wang,Shou-Yu Chen,Hsiang-Ming Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 3 tables, and submitted to 2026 IEEE Globecom
Abstract:In this paper, we propose a spatial-temporal learning-based distributed routing framework for dynamic Low Earth Orbit (LEO) satellite networks, where graph attention networks (GAT) and long short-term memory (LSTM) are integrated within a deep Q-network (DQN)-based architecture to enable distributed and adaptive routing decisions based on local observations. The routing problem is formulated as a partially observable Markov decision process (POMDP) to address partial observability under dynamic topology and time-varying traffic. Simulation results show that the proposed method significantly outperforms conventional and learning-based routing schemes in terms of throughput, packet loss, queue length, and end-to-end delay, while achieving proactive congestion avoidance with up to 23.26% queue reduction. In addition, the proposed approach maintains low computational overhead with negligible carbon emissions, demonstrating its efficiency from a Green AI perspective.
[LG-25] Inducing Permutation Invariant Priors in Bayesian Optimization for Carbon Capture and Storag e Applications
链接: https://arxiv.org/abs/2605.02409
作者: Sofianos Panagiotis Fotias,Vassilis Gaganis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Optimization is an iterative method, tailored to optimizing expensive black box objective functions. Surrogate models like Gaussian Processes, which are the gold standard in Bayesian Optimization, can be inefficient for inputs with permutation symmetries, as the most common kernels employed are better suited for vector inputs rather than unordered sets of items. Motivated by this issue, we turn to permutation invariant Bayesian Optimization for well placement in Carbon Capture and Storage projects. The high fidelity black box simulator is instructed to operate wells under group control, giving rise to permutation symmetries within injector and producer groups that cannot be exploited with standard GP kernels. In this work, our main contribution is a novel Gaussian Process kernel (GP-Perm) that encodes permutation invariance by comparing sets through a stable divergence between their induced empirical representations, and can be combined with standard kernels for additional vector-valued inputs. As a learned invariant baseline, we also consider a Deep Kernel Learning model (DKL-DS) using the Deep Sets architecture to learn a permutation-invariant embedding. We evaluate the proposed methodology across 8 use cases, comprising seven synthetic benchmarks and one realistic CCS case study (Johansen formation)
[LG-26] Closed-Loop CO2 Storag e Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation
链接: https://arxiv.org/abs/2605.02405
作者: Sofianos Panagiotis Fotias,Vassilis Gaganis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Closed-loop management of geological CO2 storage requires control policies that adapt to uncertain reservoir behavior while relying on observations that are realistically available during operation. This work formulates CO2 injection and brine-production control as a partially observable sequential decision problem and studies deployable deep reinforcement-learning controllers trained with high-fidelity reservoir simulation. We first compare privileged-state, well-only, history-conditioned, masking-curriculum, and asymmetric teacher-student model-free policies in order to quantify the value of temporal well-response information and training-time privileged simulator states. We then evaluate a latent model-based adaptation pipeline that reuses nominal latent dynamics and retunes controllers under known injector failure, leakage-induced dynamics and reward shift, and compartmentalized reservoir connectivity. The results show that history-conditioned policies recover nearly all of the privileged-state performance while using only deployable well-level information, and that latent model-based retuning outperforms direct model-free retuning under the same scenario-specific real-simulator budget in the abnormal operating cases. The proposed framework therefore provides a simulator-budget-aware alternative to repeated online history matching and re-optimization for closed-loop CO2 storage control.
[LG-27] Statistically-Lossless Quantization of Large Language Models
链接: https://arxiv.org/abs/2605.02404
作者: Michael Helcig,Eldar Kurtic,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model’s next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR = 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at this https URL.
[LG-28] Binary Rewards and Reinforcement Learning: Fundamental Challenges
链接: https://arxiv.org/abs/2605.02375
作者: Marc Dymetman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit \beta\to 0 , the filtered model p_:=a(\cdot\mid\mathcalY_1) – the base model conditioned on validity – which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution p_[\beta]\propto a(y),e^v(y)/\beta converges to p_ in forward KL as \beta\to 0 , yet p_* cannot serve as a direct optimization target because \mathrmKL(q,|,p_) is infinite for any full-support policy q . We develop explicit formulas relating the hyperparameter \beta to the more interpretable target validity rate \mu . Under model misspecification – the typical practical regime – the pressure to decrease \beta drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as \beta decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target p_ directly – as pursued empirically by \citetkruszewski_whatever_2026 – avoid this failure mode by rewarding coverage of p_* 's support rather than concentration on high-validity outputs.
[LG-29] Predicting Post Virality with Temporal Cross-Attention over Trend Signals
链接: https://arxiv.org/abs/2605.02358
作者: Sarvagya Somvanshi,Mohan Xu,Rakhi Chadalavada,Nathan Canera
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Current models for predicting social media virality rely heavily on static textual and structural features, effectively ignoring the highly dynamic nature of trend signals. We study whether real-world attention signals can improve the prediction of social-media virality beyond what post text alone reveals. We introduce \model, an architecture that predicts Reddit post virality by fusing internal platform representations with exogenous temporal signals derived from Wikipedia pageview spikes. We frame virality as a binary classification task that accounts for differences in subreddit scale, labeling posts as viral if they exceed the 90th percentile of per-subreddit engagement and a minimum absolute score threshold. We introduce ViralityNET combines four post-level streams: title embeddings, body embeddings, structural metadata, and learned subreddit embeddings with a cross-attention block that queries a daily sliding-window trends matrix encoding the top-512 Wikipedia spike terms from the preceding seven days. Empirical results suggest that incorporating external attention signals yields consistent gains, outperforming text-only baselines by +0.015 AUC-PR and achieving an overall AUC-ROC of 0.836. Overall, we provide evidence that incorporating external attention signals yields measurable improvements over text-only baselines, highlighting the importance of real-world dynamics in shaping online virality. Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2605.02358 [cs.LG] (or arXiv:2605.02358v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] ZNO: Stable Rational Neural Operators in the Z-Domain for Discrete-Time Dynamic
链接: https://arxiv.org/abs/2605.02356
作者: Xianli Zhu,Jia Yin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce the Z-Domain Neural Operator (ZNO), a causal neural operator whose layers are stable low-rank multiple-input multiple-output (MIMO) rational filters parameterized directly in the z -plane. ZNO addresses a limitation of existing operator learning methods, many of which are primarily tailored for continuous-time problems, while a large class of system-identification problems is intrinsically discrete-time. The z -domain form expresses stability as a unit-disk pole constraint and makes learned discrete-time poles directly readable. The model combines low-rank channel mixing, smooth stable pole reparameterization, causal recurrence, and an optional short finite impulse response (FIR) branch in a single z -domain rational recurrent layer. Across controlled discrete system-identification experiments, ZNO’s advantage is most evident when the target dynamics are stable rational systems with lightly damped poles near the unit circle. Under matched parameter budgets, ZNO is not uniformly dominant; however, with validation-selected configurations, the same architecture can achieve the lowest mean error across the controlled tasks. A five-bin difficulty sweep over near-unit-circle / long-memory dynamics shows that ZNO has the lowest mean error across memory regimes, from short (approximately 10 steps) to long (approximately 100-200 steps). On five public nonlinear system-identification benchmarks, ZNO is competitive with neural operator and state-space baselines, achieving the lowest mean error on benchmarks whose dynamics align with stable rational discrete-time filters, while classical or state-space baselines remain preferable on some systems. These results position ZNO as a strong model for stable rational discrete-time dynamics, especially in near-unit-circle and long-memory regimes, but not as a universal replacement for specialized system-identification methods.
[LG-31] A Near-optimal SQ Lower Bound for Smoothed Agnostic Learning of Boolean Halfspaces
链接: https://arxiv.org/abs/2605.02350
作者: Tim Sinen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the complexity of smoothed agnostic learning of halfspaces on \pm 1^n under the uniform distribution in the model of \citetKM25 where each input coordinate is independently flipped with probability \sigma \in (0, 1/2) . We show that L^1 polynomial regression achieves complexity \tildeO(n^O(\log(1/\varepsilon)/\sigma)) , and prove a nearly matching Statistical Query complexity lower bound of n^\Omega(\log(1+\sigma/\varepsilon ^2)/\sigma) . This complements the recent work of \citetDK26, which established analogous bounds in the continuous setting under Gaussian marginals.
[LG-32] FedPLT: Scalable Resource-Efficient and Heterogeneity-Aware Federated Learning via Partial Layer Training
链接: https://arxiv.org/abs/2605.02337
作者: Ahmad Dabaja,Rachid El-Azouzi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 40 pages
Abstract:Federated Learning (FL) has gained significant attention in distributed machine learning by enabling collaborative model training across decentralized system while preserving data privacy. Although extensive research has addressed statistical data heterogeneity, FL still faces several challenges, including high communication and computation overheads and severe device heterogeneity, which require further investigation. Prior work has addressed these issues through sub-model training and partial parameter training. However, such methods often suffer from inconsistent parameter distributions across clients, inaccurate global loss estimation, and increased bias and variance. Guided by our empirical analysis, we propose FedPLT (Federated Learning with Partial Layer Training), an innovative and structured partial parameter training approach that exhibits training behavior similar to full model training while assigning client-specific portions of the model according to their communication and computational capabilities. In addition, we evaluate the performance of FedPLT when combined with optimal client sampling under communication constraints. We show that this integration improves FL performance by reducing sampling variance under the same communication budget. Through extensive experiments, we demonstrate that FedPLT achieves performance comparable to, or even surpassing, that of full-model training (i.e., FedAvg), while requiring significantly fewer trainable parameters per client. Moreover, FedPLT outperforms existing methods in highly heterogeneous environments, effectively adapts to client resource constraints, and reduces the number of straggling clients. In particular, FedPLT reduces the number of trainable parameters by 71%-82% while achieving performance on par with full-model training.
[LG-33] Differentiable Kernel Ridge Regression for Deep Learning Pipelines
链接: https://arxiv.org/abs/2605.02313
作者: Jean-Marc Mercier,Gabriele Santin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks dominate modern machine learning, while alternative function approximators remain comparatively underexplored at scale. In this work, we revisit kernel methods as drop-in components for standard deep learning pipelines. We introduce \emphSparse Kernels (SKs), a differentiable, localized, and lazy variant of kernel ridge regression (KRR) that defers training to inference time and reduces to the solution of small local systems. We integrate SKs into PyTorch as modular layers that preserve end-to-end trainability, and we show that they expose three distinct sets of parameters – feature representations, target values, and evaluation points – each of which can be fixed or learned. This decomposition broadens the design space available to practitioners, enabling, in particular, training-free transfer, nonlinear probing, and hybrid kernel-neural models. Across convolutional networks, vision transformers, and reinforcement learning, SK-based modules serve two complementary roles: in some settings, they match the performance of trained neural readouts with substantially less training; in others, they augment existing models and improve their performance when used as additional components. Our results suggest that kernel methods, once made scalable and differentiable, can be readily integrated with deep learning rather than treated as a separate paradigm.
[LG-34] A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
链接: https://arxiv.org/abs/2605.02300
作者: Sanjiv R. Das,Harshad Khadilkar,Sukrit Mittal,Daniel Ostrov,Deep Srivastav,Hungjen Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple year scenario over which the investor looks to optimally choose an investment portfolio each year and choose to fulfill all, some, or none of the different financial goals that arise each year. These choices seek to maximize the expected total investor utility obtained from the fulfilled financial goals. By eliminating separate training and optimization for each new investor problem, the MetaRL model in inference mode produces near-optimal dynamic investment portfolio and goal-fulfilling strategies for a new GBWM problem within a few hundredths of a second. This delivers expected utilities that are, on average, 97.8% of the optimal expected utilities (determined via Dynamic Programming). These results are remarkably robust to capital market regime changes, even when training uses only one capital market regime. Further, the MetaRL approach can enable solving problems with larger state spaces where Dynamic Programming becomes computationally infeasible.
[LG-35] Graph Federated Unlearning for Privacy Preservation
链接: https://arxiv.org/abs/2605.02297
作者: Ruotong Ma,Wentao Yu,Qizhou Wang,Jie Yang,Chen Gong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph federated learning (GFL) facilitates decentralized training on distributed graph data while keeping sensitive user information local, aligning with policies such as GDPR and CCPA that grant users the right to freely join or withdraw from learning systems. However, even decentralized, user information can persist after quitting, potentially propagating to central servers and then redistributing to malicious clients. This privacy leakage during user withdrawal, despite its importance, has received seldom attention in GFL. To fill the gap, we explore the potential of machine unlearning (MU) to thoroughly remove user information. However, classical MU methods are known to degrade overall performance, a problem that is exacerbated in GFL due to local message passing and global model collaboration. To this end, we make two adjustments to mitigate this challenge for GFL. First, we ensure unlearning updates that minimally affect overall performance, steering them in directions orthogonal to the gradients from learning other data. Second, we introduce virtual clients, maintained by the central server, to preserve graph topology and global embeddings without recovering information of removed entities. We conduct comprehensive experiments under a representative user-withdrawal scenario and propose a novel membership inference framework to rigorously evaluate and validate the reliability of our privacy preservation. The experimental results demonstrate the effectiveness of our approach, which also surpasses the performance of seven state-of-the-art baseline methods.
[LG-36] Variational Matrix-Learning Fourier Networks for Parametric Multiphysics Surrogates
链接: https://arxiv.org/abs/2605.02280
作者: Xinyu Li,Jianhua Zhang,Liang Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Multiphysics simulation is critical for system-technology co-optimization (STCO) in chiplet-based design, but repeated finite-element solutions of PDE-governed problems are computationally expensive in parametric design exploration. This paper proposes a variational matrix-learning Fourier network (VMLFN) for efficient parametric multiphysics surrogate modeling. VMLFN constructs a log-space sine neural representation with randomly sampled spectral frequencies, frequency-dependent decay regulation, and embedded Dirichlet boundary conditions. With fixed hidden-layer parameters, the output-layer weights are determined by reformulating the governing PDEs into variational weak forms and enforcing the stationarity condition of the resulting energy functional. This converts physics-informed training into a linear matrix-solving problem, requiring only first-order derivatives and avoiding both high-order automatic differentiation and penalty-coefficient tuning. A heuristic frequency-scanning algorithm is further introduced to select a problem-adaptive maximum frequency that covers the dominant spectral range of the target problem. The proposed method is validated on heat conduction, solid mechanics, and Helmholtz wave propagation problems. Results from five benchmark cases demonstrate that VMLFN delivers accurate full-field predictions with substantial speedup over conventional physics-informed neural networks and repeated finite-element simulations.
[LG-37] Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2605.02263
作者: Yan Jiang,Ruihong Qiu,Zi Huang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 11 figures, ICML 2026
Abstract:Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all’’ assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM’s post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1’s consistent improvement over existing fixed-size block baselines. Our code has been released at this https URL.
[LG-38] Demographic-Aware Transfer Learning for Sleep Stage Classification in Clinical Polysomnography
链接: https://arxiv.org/abs/2605.02245
作者: S M Asif Hossain,Shruti Kshirsagar
类目: Machine Learning (cs.LG)
*备注: Under review at IEEE SMC 2026
Abstract:Automated sleep stage classification typically employs a single population-agnostic model, disregarding established demographic variations in sleep architecture. Sleep patterns, however, differ substantially across gender, age, and obstructive sleep apnea (OSA) severity, indicating that a onesize-fits all approach may be suboptimal for diverse clinical populations. In this paper, we propose a two stage training strategy based on demographic stratification and transfer learning framework. We first pretrains a convolutional recurrent model on the full population and then fine tunes it independently for demographic subgroups defined by gender, age, and Apnea-Hypopnea Index (AHI) severity according to the AASM clinical standard. Using the DREAMT dataset comprising 100 clinical subjects and 7 PSG channels, we evaluate 37 fine-tuned configurations across single-axis and two-way demographic combinations. Results demonstrate that 35 of the 37 fine-tuned models outperform the baseline, with Cohen’s kappa improvements ranging from 0.9 to 12.9%. These findings indicate that stratified fine tuning tailored to specific patient demographics yields substantially more accurate sleep staging than a single generalized model, offering a practical and clinically grounded paradigm for personalized sleep assessment.
[LG-39] DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
链接: https://arxiv.org/abs/2605.02196
作者: Abdullah Ahmad Khan,Ferdous Sohel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning aims to remove specified training data to satisfy privacy regulations such as GDPR. However, existing evaluations assume identical precision at unlearning and deployment, overlooking that production LLMs are deployed at low-bit precision. We show that INT4 quantization systematically restores forgotten content even when models pass compliance audits at bfloat16 (BF16), we term this the quantization recovery attack (QRA). We conduct the first systematic study of unlearning robustness under adapter-space INT4 quantization in the NF4+LoRA regime, evaluating seven methods on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU. INT8 is benign; INT4 induces recovery of up to 22x, worsening with dataset difficulty. We identify the FA-RA-Q-INT4 trilemma: no method simultaneously achieves strong forgetting, high utility, and quantization robustness. A dense Pareto sweep reveals a sharp phase transition once robustness is achieved, retaining accuracy collapses regardless of further tuning. To address this, we propose DURABLEUN-SAF (Sharpness-Aware Forgetting), a quantization-aware objective using Straight-Through Estimator gradients through INT4 rounding. DURABLEUN-SAF is the only method to achieve a stable empirical (0.047, BF16, INT8, INT4)- durability certificate: Q-INT4= 0.043 ± 0.002, cert rate= 3/3, versus SalUn’s cert rate= 1/3 at its own published hyperparameters. We call for Q-INT4 to be adopted as a standard evaluation metric alongside FA and RA.
[LG-40] KANs need curvature: penalties for compositional smoothness
链接: https://arxiv.org/abs/2605.02190
作者: James Bagrow
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 14 pages, 6 figures, 1 table
Abstract:Kolmogorov-Arnold networks (KANs) offer a potent combination of accuracy and interpretability, thanks to their compositions of learnable univariate activation functions. However, the activations of well-fitting KANs tend to exhibit pathologically high-curvature oscillations, making them difficult to interpret, and standard regularization penalties do not prevent this. Here we derive a basis-agnostic curvature penalty and show that penalized models can maintain accuracy while achieving substantially smoother activations. Accounting for how function composition shapes curvature, we prove an upper bound on the full model’s curvature relative to the curvature penalty, and use this to motivate richer forms of penalties. Scientific machine learning is increasingly bottlenecked by the trade-off between accuracy and interpretability. Results such as ours that improve interpretability without sacrificing accuracy will further strengthen KANs as a practical tool for both prediction and insight.
[LG-41] Manifold-Constrained Adversarial Training for Long-Tailed Robustness via Geometric Alignment IJCAI2026
链接: https://arxiv.org/abs/2605.02183
作者: Guanmeng Xian,Ning Yang,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2026
Abstract:Adversarial training is effective on balanced datasets, but its robustness degrades under longtailed class distributions, where tail classes suffer high robust error and unstable decision boundaries. We propose Manifold-Constrained Adversarial Training (MCAT), a unified framework that enforces the semantic validity of adversarial examples by penalizing deviations from class-conditional manifolds in feature space, while promoting balanced geometric separation across classes via an ETF-inspired regularization. We provide theoretical results that link geometric separation to lower bounds on adversarially robust margins, and show that manifold-constrained adversarial risk upperbounds robust risk on high-density semantic regions. Extensive experiments on standard longtailed benchmarks demonstrate consistent improvements in overall, balanced, and tail-class adversarial robustness.
[LG-42] Experience Constrained Hierarchical Federated Reinforcement Learning for Large-scale UAV Teams in Hazardous Environments IJCNN2026
链接: https://arxiv.org/abs/2605.02165
作者: Qinwei Huang,Rui Zuo,Simon Khan,Qinru Qiu
类目: Machine Learning (cs.LG)
*备注: Accepted by the International Joint Conference on Neural Networks (IJCNN 2026), part of WCCI 2026
Abstract:Conventional federated learning assumes that greater learner participation improves training performance, by leveraging abundant, independently generated local data. However, in federated reinforcement learning (FRL) for unmanned aerial vehicle (UAV) teams in hazardous environments where experience generation is severely constrained by safety considerations, energy limitations, and mission duration, this assumption may break. This work introduces Experience-Constrained Hierarchical Federated Reinforcement Learning (EC-HFRL), a framework in which clusters act as federated learning agents, while multiple intra-cluster learners represent parallel learning resources that reuse a shared experience pool. We show that increasing participation does not necessarily improve learning performance. Instead, learning performance is strongly associated with experience reuse strategy and the dominance of key analytically identified gradient transition experiences within a cluster. In particular, minibatch size primarily determines effective replay exposure, while higher intra-cluster participation increases reuse level. Empirical results demonstrate that the performance regimes are strongly associated with the structure of the learning signal, rather than federated aggregation effects, clarifying the limited and secondary role of learner participation in experience-constrained FRL.
[LG-43] H3: A Healthcare Three-Hop Index for Physician Referral Network Prediction
链接: https://arxiv.org/abs/2605.02150
作者: Zhexi Gu,Jiaxin Ying,Xu-Wen Wang,Can Chen
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, 7 tables
Abstract:Accurate prediction of physician referral links is essential for optimizing care coordination and reducing fragmentation in healthcare delivery. However, existing computational methods, ranging from triadic closure heuristics to graph neural networks, fail to capture the intrinsic properties of physician referral networks, including sparsity, disassortative degree mixing, and hub-dominated topology. Here, we propose H3, a healthcare three-hop index that addresses these limitations by modeling indirect referral pathways through intermediate physicians, with degree-based normalization and a redundancy penalty to mitigate hub-mediated noise. Using Medicare Physician Shared Patient Patterns data, we evaluate H3 under two complementary prediction regimes: within-period prediction, which assesses recovery of contemporaneous referral links under sparse conditions, and cross-period prediction, which tests robustness to temporal shift as referral windows expand. Across both regimes, H3 consistently outperforms classical heuristics and deep learning-based baselines. Unlike black-box neural network approaches, H3 produces fully decomposable predictions traceable to specific intermediary physicians, offering a transparent and deployable solution for referral network completion.
[LG-44] Projection-Free Transformers via Gaussian Kernel Attention
链接: https://arxiv.org/abs/2605.02144
作者: Debarshi Kundu,Archisman Ghosh,Swaroop Ghosh,Vasant Honavar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-attention in Transformers is typically implemented as \mathrmsoftmax(QK^\top/\sqrtd)V , where Q=XW_Q , K=XW_K , and V=XW_V are learned linear projections of the input X . We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbfGaussian Kernel Attention (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter \sigma_h , while a single output projection W_O preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \textttnanochat framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with 0.42\times the parameters and 0.49\times the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.
[LG-45] Personalized Federated Learning for Gradient Alignment
链接: https://arxiv.org/abs/2605.02143
作者: Dongwon Kim,Gyuejeong Lee
类目: Machine Learning (cs.LG)
*备注: 14 pages, 4 figures
Abstract:Personalized federated learning (pFL) aims to adapt models to client specific data distributions, yet it often fails to reliably preserve personalized information. Local training is hindered by high variance gradients induced by limited and heterogeneous client data, while aggregation further distorts client specific optimization directions. To address these challenges, we propose pFLAlign, a gradient alignment framework to maintain client specific information during both local training and aggregation. pFLAlign consists of two complementary mechanisms: one adapts local gradient directions to reduce variance during client side optimization, and the other mitigates aggregation induced distortion by realigning the global model with each client’s personalized direction. Theoretically, we derive pFLAlign from a PAC Bayesian analysis, which reveals how personalized gradient alignment preserves client specific information. Our experiments and ablation studies show that pFLAlign consistently improves personalization performance and training stability, achieving state of the art results.
[LG-46] LUMINA: A Grid Foundation Model for Benchmarking AC Optimal Power Flow Surrogate Learning
链接: https://arxiv.org/abs/2605.02133
作者: Hongwei Jin,Keunju Song,Zeeshan Memon,Yijiang Li,Stefano Fenu,Hongseok Kim,Liang Zhao,Kibaek Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:AC optimal power flow (ACOPF) is foundational yet computationally expensive in power grid operations, driving learning-based surrogates for large-scale grid analysis. These surrogates, however, often fail to generalize across network topologies, a critical gap for deployment on grids not seen during training and for routine operational what-if studies. We introduce LUMINA-Bench, a comprehensive benchmark suite for ACOPF surrogate learning covering multi-topology pretraining, transfer, and adaptation. The benchmark evaluates homogeneous and heterogeneous architectures under single- and multi-topology learning settings using unified metrics that capture both predictive accuracy and physics-informed constraint violations. We additionally compare constraint-aware training objectives, including MSE, augmented Lagrangian, and violation-based Lagrangian losses, to characterize accuracy-robustness trade-offs across settings. Data processing, training, and evaluation frameworks are open-sourced as the LUMINA suite to support reproducibility and accelerate future research on feasibility-aware OPF surrogates.
[LG-47] FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
链接: https://arxiv.org/abs/2605.02125
作者: Yijiang Li,Emon Dey,Zilinghan Li,Krishnan Raghavan,Ravi Madduri,Kibaek Kim
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate \mathcalO(1/\sqrtR) under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.
[LG-48] Statistical Consistency and Generalization of Contrastive Representation Learning ICML2026
链接: https://arxiv.org/abs/2605.02116
作者: Yuanfan Li,Xiyuan Wei,Tianbao Yang,Yiming Ying
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emphstatistically consistent with optimal ranking. We further establish a \emphcalibration-style inequality that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order O(1/m + 1/\sqrtn) and O(1/\sqrtm + 1/\sqrtn) , respectively, where m denotes the number of negative samples and n the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between m and n . Extensive experiments on large-scale vision–language models corroborate our theoretical predictions. Comments: Accepted by ICML 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.02116 [cs.LG] (or arXiv:2605.02116v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] Geometric and Spectral Alignment for Deep Neural Network II
链接: https://arxiv.org/abs/2605.02111
作者: Ziran Liu,Wei Wang,Jinhao Wang,Pengcheng Wang,Xinyi Sui,Cihan Ruan,Nam Ling,Wei Jiang
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 81 pages, 5 figures
Abstract:This paper develops the angular and static-channel component of Geometric and Spectral Alignment for residual Jacobian chains. Starting from Cartan-coordinate rigidity and fitted effective-rank windows, we study how dominant singular subspaces are transported across adjacent layers and how the resulting finite matrices can be displayed in physical channel coordinates. The main results are deterministic, margin-verified results. We bound the error between full interface transport and its dominant-window truncation, add fitted-tail errors so that empirical spectra can be certified against the Gibbs–Cartan tail model, and distinguish source-mode incidence from fully physical input-output channel incidence. Given row groups and active supports, the Physical Alignment Matrix decomposes orthogonally as core plus overlap plus noise. Active-column gaps, pairwise overlap margins, and noise bounds combine into a static certificate radius under which the full transport and the truncated transport induce the same active supports, pairwise incidence graph, SRS sets, hub columns, and core/overlap/noise masks. The finer SC/SA/ST labels of the Invariant Channel Mapping require additional row-energy and profile-correlation margins, stated as explicit perturbation tests. The empirical section reports the matrices and block-energy heatmaps that measure these certificate quantities across CNNs, language models, and vision/diffusion backbones. The figures are interpreted as finite-dimensional measurements; complete membership in the Physical GSA certificate domain requires checking the numerical margin protocol stated in Section 10. Comments: 81 pages, 5 figures Subjects: Machine Learning (cs.LG); Differential Geometry (math.DG) Cite as: arXiv:2605.02111 [cs.LG] (or arXiv:2605.02111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Adversarial Update-Based Federated Unlearning for Poisoned Model Recovery
链接: https://arxiv.org/abs/2605.02110
作者: Wenwei Zhao,Xiaowen Li,Yao Liu,Zhuo Lu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated learning (FL) is vulnerable to poisoning attacks, where malicious clients upload manipulated updates to degrade the performance of the global model. Although detection methods can identify and remove malicious clients, the model remains affected. Retraining from scratch is effective but costly, and existing unlearning methods remain unsatisfactory in both effectiveness and efficiency. We propose Federated Adversarial Unlearning (FAUN), a lightweight framework that retains only a short window of malicious clients’ updates and employs adversarial optimization on a proxy dataset to derive updates that eliminate malicious directions. Applying these updates for a few unlearning rounds, followed by benign fine-tuning, enables fast removal of malicious effects and stable recovery. Experiments on three canonical datasets show that FAUN achieves recovery comparable to retraining while requiring far fewer rounds and reduces attack success rates to near zero, confirming FAUN successfully eliminates the contributions of unlearned clients.
[LG-51] Detecting Adversarial Data via Provable Adversarial Noise Amplification
链接: https://arxiv.org/abs/2605.02109
作者: Furkan Mumcu,Yasin Yilmaz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:The nonuniform and growing impact of adversarial noise across the layers of deep neural networks has been used in the literature, without a formal mathematical justification, to detect adversarial inputs and improve robustness. In this work, we study this phenomenon in detail and present a formal adversarial noise amplification theorem. We specify a set of sufficient conditions under which the adversarial noise amplification is mathematically guaranteed. Based on theoretical observations, we propose a novel training methodology with a custom spectral loss function and a specific architectural design to enhance the amplification signal for detecting adversarial data. Finally, we introduce a new, lightweight detection mechanism that leverages the enhanced amplification signal and operates entirely at inference time. To validate our approach, we demonstrate the detector’s efficacy against both state-of-the-art attacks and a purpose-built adaptive attack, confirming that enhanced amplification can serve as a robust and reliable signal for adversarial defense.
[LG-52] Geometric and Spectral Alignment for Deep Neural Network I
链接: https://arxiv.org/abs/2605.02108
作者: Ziran Liu,Wei Wang,Jinhao Wang,Pengcheng Wang,Xinyi Sui,Cihan Ruan,Nam Ling,Wei Jiang
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 41 pages, 1 figure
Abstract:Deep residual architectures are modeled as products of near-identity Jacobians. This paper proves deterministic quotient-geometric estimates for singular spectra of Frobenius-normalized layer factors, emphasizing a normalized top-radial Cartan coordinate and fitted power-law chart. Full-rank factors are mapped from \mathrmGL(d) to the positive cone by A\mapsto A^\top A , then to ordered eigenvalue data. Under Frobenius normalization, exact power-law spectra form a trace-normalized Cartan orbit. This orbit is a Gibbs family on ranks, a Fisher information line, and a Bures–Wasserstein curve with line element d/4 times Fisher information. The main rigidity theorem is a slack-aware margin inequality: interface radial amplitude, non-backtracking slack, and signed residual variation control displacement of the fitted Cartan coordinate. In the exact-chart zero-slack case, a depth- L budget gives exponent drift of order (\log M)/L ; generally, slack and residual increments augment the bound. We separate scalar top-radial from full-Cartan spectral control, which also needs Bures/Hellinger residual variation. We prove approximate-power-law and metric-chart versions, converse lower bounds, Fisher–KL/Bures action estimates, and near-identity expansions for normalized residual chains. Near-identity results verify transport budgets; chart quality remains measurable. Effective rank is a spectral-energy quantile, giving finite-width power-law tail bounds and robust rank-window transition estimates. Empirical static-weight exponent profiles serve as diagnostics; full verification also requires interface budgets, slacks, and residuals for the same operator chain. Comments: 41 pages, 1 figure Subjects: Machine Learning (cs.LG); Differential Geometry (math.DG) Cite as: arXiv:2605.02108 [cs.LG] (or arXiv:2605.02108v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.02108 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziran Liu [view email] [v1] Mon, 4 May 2026 00:07:24 UTC (377 KB) Full-text links: Access Paper: View a PDF of the paper titled Geometric and Spectral Alignment for Deep Neural Network I, by Ziran Liu and 7 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs math math.DG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-53] Bridging the Gap Between Averag e and Discounted TD Learning
链接: https://arxiv.org/abs/2605.02103
作者: Haoxing Tian,Zaiwei Chen,Ioannis Ch. Paschalidis,Alex Olshevsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem’s condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.
[LG-54] Weight Clipping for Robust Conformal Inference under Unbounded Covariate Shifts
链接: https://arxiv.org/abs/2605.02072
作者: James Wang,Surbhi Goel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction (CP) provides powerful, distribution-free prediction sets, but its guarantees rely on the exchangeability of training and test data, which is often violated in practice due to covariate shifts. While weighted conformal prediction (WCP) is designed to handle such shifts, it can suffer from significant undercoverage when the density ratio between the distributions is unbounded and/or must be learned. This is because of both overfitting in learning the density ratio, and high variance in estimating the nonconformity score threshold. To address this, we introduce clipped least-squares importance fitting (CLISF) as a reduced-variance method for density ratio estimation. Specifically, we show that density ratios learned using CLISF, when plugged into WCP, have bounded expected undercoverage. Furthermore, we show that the undercoverage can be corrected by running WCP with a slightly inflated coverage target; crucially, we are able to estimate the required level of inflation from the data. We provide the first theoretical guarantees for weight clipping in conformal inference, achieving dataset-conditional coverage with a sample complexity that does not blow up with the higher moments of the true density ratio – a key limitation of prior work. We verify our results on real-world benchmarks and synthetic data.
[LG-55] DR-SNE: Density-Regularized Stochastic Neighbor Embedding
链接: https://arxiv.org/abs/2605.02060
作者: Maksim Kazanskii
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality reduction methods such as t-SNE are designed to preserve local neighborhood structure but do not explicitly account for how probability mass is distributed, often leading to distortions of data density. We reformulate dimensionality reduction as the joint alignment of two components: (i) conditional structure, capturing local relationships, and (ii) relative density structure, captured via local density statistics. Based on this perspective, we introduce Density-Regularized SNE (DR-SNE), which augments the stochastic neighbor embedding objective with a density regularization term derived from normalized log-density estimates. Unlike prior approaches such as DensMAP and DenSNE, which rely on local scale consistency, DR-SNE directly aligns normalized density estimates, providing a simple and scale-invariant mechanism for preserving relative density variations. Empirically, DR-SNE improves density preservation while maintaining competitive neighborhood fidelity, and yields gains on density-sensitive tasks such as anomaly detection across multiple datasets. These results suggest that incorporating density information complements geometry-focused objectives in dimensionality reduction.
[LG-56] NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training
链接: https://arxiv.org/abs/2605.02044
作者: Reza Rawassizadeh,Tanvi Sharma
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 6 tables
Abstract:Training neural networks is difficult to interpret, particularly for newcomers. We introduce NeuroViz, an interactive visualization tool that supports real-time exploration of fully connected neural network training. Users can configure network architecture, activation functions, learning rates, and datasets, then observe activations, weight updates, and loss progression. NeuroViz visualizes weight changes in direct correspondence with activation signals in both forward and backward passes, enabling users to distinguish pre- and post-update states within individual epochs and view dynamically updating per-neuron equations. We conduct a comparative user study with 31 participants against six established visualization tools and we achieved the highest usability score (SUS 80.97, in the ‘excellent’ range), with mean rankings of 2.47 for clarity and 2.23 for usefulness (lower is better). Over 70% of participants reported that the visualizations substantially increased their perception of neural network training transparency. The implemented instance is accessible at this https URL.
[LG-57] Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
链接: https://arxiv.org/abs/2605.02043
作者: Tehila Dahan,Roie Reshef,Sharon Goldstein,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically attenuate or discard delayed gradients, introducing systematic bias: updates from simpler or faster-to-process samples are overrepresented, while gradients from more complex samples are delayed or suppressed. In contrast, prior approaches to data-dependent delays rely on a Lipschitz assumption that yields suboptimal rates or leave the smooth, convex case unaddressed. We propose a momentum-based asynchronous framework designed to preserve information from delayed gradients while mitigating the effects of staleness. We establish the first optimal convergence rates for data-dependent delays in both convex and non-convex smooth setups, providing a new result for asynchronous optimization under standard assumptions. Additionally, we derive robust learning-rate schedules that simplify hyperparameter tuning in practice.
[LG-58] IJERE: A Novel Threat Intelligence Joint Extraction Model Based on Analyst Expert Knowledge
链接: https://arxiv.org/abs/2605.02041
作者: Inoussa Mouiche,Sherif Saad
类目: Machine Learning (cs.LG)
*备注: 16 pages
Abstract:The extraction of entities and relationships from threat intelligence reports into structured formats, such as cybersecurity knowledge graphs, is essential for automated threat analysis, detection, and mitigation. However, existing joint extraction methods struggle with feature confusion, language ambiguity, noise propagation, and overlapping relations, resulting in low accuracy and poor model performance. This paper presents TIJERE, an innovative joint entity and relation extraction framework that formulates joint extraction as a multisequence labeling representation (MSLR) problem. Specifically, separate sequences are generated for each entity pair. Unlike prior tagging schemes, MSLR integrates expert domain features to enrich positional, contextual, and semantic representations of entities, thereby enhancing feature distinction and classification accuracy. Additionally, TIJERE reduces language ambiguity and enhances domain-specific generalization by leveraging SecureBERT+, a contextual language model fine-tuned on cybersecurity text. This improves both named entity recognition (NER) and relation extraction (RE). This paper also introduces DNRTI-JE, the first publicly available jointly labeled dataset for cybersecurity entity and RE, filling a crucial gap in cyber threat intelligence automation. Empirical evaluations on the curated DNRTI-JE dataset demonstrate that TIJERE achieves state-of-the-art performance, with F1-scores exceeding 0.93 for NER and 0.98 for RE, outperforming existing methods. Together, TIJERE and the standardized benchmarking DNRTI-JE dataset enable high-performance cybersecurity intelligence extraction, with transferable applications in healthcare, finance, and bioinformatics.
[LG-59] Large margin classifier with graph-based adaptive regularization
链接: https://arxiv.org/abs/2605.02027
作者: Vítor M. Hanriot,Turíbio T. Salis,Luiz C.B. Torres,Frederico Coelho,Antonio P. Braga
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for publication in Pattern Recognition Letters
Abstract:This paper introduces the use of per-class regularization hyperparameters in Gabriel graph-based binary classifiers. We demonstrate how the quality index used for regularization behaves both in the margin region and in the presence of outliers, and how incorporating this regularization flexibility can lead to solutions that effectively eliminate outliers while training the classifier. We also show how it can address class imbalance by generating higher and lower thresholds for the majority and minority classes, respectively. Thus, rather than having a single solution based on fixed thresholds, flexible thresholds expand the solution space and can be optimized through hyperparameter tuning algorithms. Friedman test shows that flexible thresholds are capable of improving Gabriel graph-based classifiers.
[LG-60] owards Systematic Generalization for Power Grid Optimization Problems
链接: https://arxiv.org/abs/2605.02026
作者: Zeeshan Memon,Yijiang Li,Hongwei Jin,Kibaek Kim,Liang Zhao
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures. Preprint, under review
Abstract:AC Optimal Power Flow (ACOPF) and Security-Constrained Unit Commitment (SCUC) are fundamental optimization problems in power system operations. ACOPF serves as the physical backbone of grid simulation and real-time operation, enforcing nonlinear power flow feasibility and network limits, while SCUC represents a core market-level decision process that schedules generation under operational and security constraints. Although these problems share the same underlying transmission network and physical laws, they differ in decision variables and temporal coupling, and prior learning-based approaches address them in isolation, resulting in disjoint models and this http URL propose a learning framework that jointly models ACOPF and SCUC through a shared graph-based backbone that captures grid topology and physical interactions, coupled with task-specific decoders for static and temporal decision-making. Training includes solver supervision with physics-informed objectives to enforce AC feasibility and inter-temporal operational constraints. To evaluate generalization, we assess cross-case transfer on unseen grid topologies for ACOPF and SCUC without retraining, and systematic generalization on the UC-ACOPF problem using unsupervised, physics-based objectives and a power-dispatch consensus mechanism. Experiments across multiple grid scales demonstrate improved performance and transferability relative to existing learning-based baselines, indicating that the model can support learning across heterogeneous power system optimization problems.
[LG-61] Robust and Explainable Divide-and-Conquer Learning for Intrusion Detection
链接: https://arxiv.org/abs/2605.02015
作者: Yan Zhou,Kevin Hamlen,Michael De Lucia,Murat Kantarcioglu,Latifur Khan,Sharad Mehrotra,Ananthram Swami,Bhavani Thuraisingham
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:Machine learning-based intrusion detection requires complex models to capture patterns in high-dimensional, noisy, and class-imbalanced raw network traffic, yet deploying such models remains impractical on resource-constrained devices with limited processing power and memory. In this paper, we present a correlation-aware divide-and-conquer learning technique that decomposes a complex learning problem into smaller, more manageable subproblems. This enables lightweight models as simple as decision trees to be trained on focused subtasks, yielding up to 43.3% higher local accuracy and up to 257 times reduction in model size on real-world network intrusion detection datasets, while also improving adversarial robustness and explainability.
[LG-62] Real-Time Text Transmission via LLM -Based Entropy Coding over Fixed-Rate Channels
链接: https://arxiv.org/abs/2605.01991
作者: Vishnu Teja Kunde,Jean-Francois Chamberland,Krishna R. Narayanan,Jamison Ebert
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Learning, prediction, and compression are intimately connected: a model that accurately predicts the next symbol in a sequence can be coupled with a source coder to compress that sequence near its information-theoretic limit. When tokenized characters arriving at a fixed reading pace are encoded into variable-length codewords and streamed over a fixed-rate channel, a queue forms whose per-token delay depends on the mean and variance of the bit lengths and on the coder’s algorithmic latency. This paper investigates the compression–delay tradeoff that arises when a causal language model serves as the sequential predictor within a predict-then-code architecture for real-time text transmission. Several coding schemes are compared: Shannon (ideal), Huffman, arithmetic coding, rANS at various block sizes, and gzip. The analysis separates algorithmic delay, inherent to the coder, from computational delay, which shrinks as hardware improves. Huffman is the practical choice for over-provisioned channels, with zero algorithmic delay and modest compression overhead. Arithmetic coding achieves near-optimal compression at the cost of decodability delay. Findings are validated across two scales: GPT-2 (124M) and Llama~3.2 (3B), a twenty-five-fold parameter range. This scaling yields an approximately 38% reduction in bits per character, effectively over-provisioning the channel and thereby changing which coder is optimal.
[LG-63] DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
链接: https://arxiv.org/abs/2605.01989
作者: Zechen Ma,Zixi Qu,Jinyan Yi,David Lin,Yashar Ganjali
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Distributed machine learning (ML) training has become a necessity with the prevalence of billion to trillion-parameter-scale models. While prior work has improved training efficiency from the ML perspective at the application layer, it often fails to address transient congestion events at the network layer that introduce severe tail latency and training-time variability, thereby undermining the quality of service (QoS) of distributed ML training systems. Existing network optimizations treat all gradients equally and thus fail to integrate sufficient model-training insights into communication protocol design. In this paper, we present Dynamic Bounded-Loss Protocol (DBLP), a burst-resilient, training-phase-aware, and hardware-agnostic transport protocol that incorporates model-level tolerance properties into gradient communication. By dynamically adjusting gradient loss tolerance across training phases, DBLP reduces overall training time and mitigates tail-latency collapse during transient high-loss events (i.e., microbursts). Compared to the current state-of-the-art solution (baseline), DBLP tolerates significantly higher loss while achieving comparable test accuracy, and reduces end-to-end training time by an average of 24.4% and a maximum of 33.9%. At microburst events, DBLP achieves up to 5.88x single-round communication latency speedups over the baseline, preventing burst-induced tail-latency spikes and maintaining stable training performance. Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.01989 [cs.LG] (or arXiv:2605.01989v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-64] Misclassification Rate and Privacy-Utility Trade-offs in Graph Convolutional Networks via Subsampling Stability
链接: https://arxiv.org/abs/2605.01987
作者: Yexin Zhang,Zhongtian Ma,Qiaosheng Zhang,Zhen Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study differential privacy (DP) in Graph Convolutional Networks (GCNs) through the framework of \textitsubsampling stability. We derive upper bounds on the misclassification rate that depend explicitly on the subsampling probability p_s . Furthermore, we characterize the \textitprivacy–utility trade-off by identifying feasible ranges of p_s ; if p_s is too large, the stability-based privacy condition becomes difficult to satisfy, yielding vacuous guarantees, whereas if it is too small, accuracy deteriorates. Our results provide the first rigorous theoretical framework for understanding subsampling stability in GCNs under DP.
[LG-65] AdamO: A Collapse-Suppressed Optimizer for Offline RL
链接: https://arxiv.org/abs/2605.01968
作者: Nan Qiao,Sheng Yue,Shuning Wang,Ju Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one. Further analysis suggests that standard Adam updates can inadvertently distort the parameter geometry, motivating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam’s continuous-time dissipative dynamics. Empirically, AdamO is broadly compatible with diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.
[LG-66] Retrieval with Multiple Query Vectors through Anomalous Pattern Detection
链接: https://arxiv.org/abs/2605.01965
作者: Allassan Tchangmena A Nken,Baimam Boukar Jean Jacques,Miriam Rateike,Celia Cintas,Skyler Speakman
类目: Machine Learning (cs.LG)
*备注:
Abstract:A classical vector retrieval problem typically considers a \emphsingle query embedding vector as input and retrieves the most similar embedding vectors from a vector database. However, complex reasoning and retrieval tasks frequently require \emphmultiple query vectors, rather than a single one. In this work, we propose a retrieval method that considers multiple query vectors simultaneously and retrieves the most relevant vectors from the database using concepts from anomalous pattern detection. Specifically, our approach leverages a set of query vectors Q (with |Q|\geq 1 ), and identifies the subset of vector dimensions within Q that standout (anomalous) from the rest of dimensions. Next, we scan the vector database to retrieve the set of vectors that are also anomalous across the previously identified vector dimensions and return them as our retrieved set of vectors. We validate our approach on two image datasets, a text dataset, and a tabular dataset. Overall, we observe that, across most datasets, larger query sets lead to improved retrieval performance. The improvement is most pronounced when increasing the query sets from 1 to 8, while the gains become smaller beyond that.
[LG-67] Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare
链接: https://arxiv.org/abs/2605.01961
作者: Maheed H. Ahmed,Mahsa Ghasemi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the average preference of all human evaluators, which, under large variations of preferences, can be unfair to minority groups. In this work, we consider fairness in dueling bandits, a standard framework for online learning from preference data. We assume that each user has a (potentially distinct) Condorcet winner, which is an arm preferred to every other arm. Using these user-specific Condorcet winners as reference points, we evaluate and score arms according to their performance relative to the corresponding winner. To promote fairness across heterogeneous users, we adopt the well-established Nash Social Welfare objective, which maximizes the product of user utilities, thereby inherently penalizing inequality and preventing the marginalization of any single user. Within this framework, we construct a hard instance to establish a regret lower bound of \Omega(T^2/3\min(K,D)^\frac13) for a time horizon T , K arms, and D users, which, to the best of our knowledge, is the first result quantifying the cost of fairness in dueling bandits with heterogeneous preferences. We then present the Fair-Explore-Then-Commit and Fair- \epsilon -Greedy algorithms with a Condorcet winner identification phase. We further derive their regret upper bounds that match the lower-bound dependence on T up to logarithmic factors.
[LG-68] Pandoras Regret: A Proper Scoring Rule for Evaluating Sequential Search
链接: https://arxiv.org/abs/2605.01936
作者: Gerardo A. Flores,Yash Deshpande,Jannis R. Brea,Ashia C. Wilson
类目: Machine Learning (cs.LG)
*备注:
Abstract:In sequential search, alternatives are tested until the true class is found. Standard proper scoring rules like log loss are local, ignoring the ranking of competitors and misaligning model evaluation with search utility. We show that sequential search induces a pairwise structure that overcomes this. By analyzing the expected cost of optimal search under varying testing costs, we derive Pandora’s Regret: a closed-form, pairwise-additive, and strictly proper scoring rule. Pandora’s Regret both elicits true probabilities and penalizes rank-reversing miscalibrations where distractors outrank the true class. Our construction yields a one-parameter Beta family that balances penalties for rank-swapping versus probability magnitude, while retaining a grounded interpretation as expected search cost. We prove that log loss, accuracy, and macro-F1 rely on implicit decision models misaligned with sequential search. Across 597 MedMNIST models, Pandora-based metrics better predict clinical diagnostic costs than standard alternatives, extending decision-theoretic scoring rule construction to the multiclass setting.
[LG-69] SwiftChannel: Algorithm-Hardware Co-Design for Deep Learning-Based 5G Channel Estimation
链接: https://arxiv.org/abs/2605.01931
作者: Shengzhe Lyu,Yuhan She,Di Duan,Tao Ni,Yu Hin Chan,Chengwen Luo,Ray C. C. Cheung,Weitao Xu
类目: Information Theory (cs.IT); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted for publication in IEEE Transactions on Mobile Computing (TMC). Code: this https URL
Abstract:Channel estimation is crucial in 5G communication networks for optimizing transmission parameters and ensuring reliable, high-speed communication. However, the use of multiple-input and multiple-output (MIMO) and millimeter-wave (mmWave) in 5G networks presents challenges in achieving accurate estimation under strict latency requirements on resource-limited hardware platforms. To address these challenges, we propose SwiftChannel, an algorithm-hardware co-design framework that integrates a hardware-friendly deep learning-based channel estimator with a dedicated accelerator. Our approach employs a convolutional neural network enhanced with a parameter-free attention mechanism, which effectively reconstructs full-resolution spatial-frequency domain channel matrices from low-resolution least squares (LS) estimates. We further develop a multi-stage model compression pipeline combining knowledge distillation, convolution re-parameterization, and quantization-aware training, resulting in substantial model size reduction with negligible accuracy loss. The hardware accelerator, implementing the compressed model and the LS estimator on FPGA platforms using High-level Synthesis (HLS), features a fine-grained pipeline architecture and optimized dataflow strategies. Tested on a Zynq UltraScale+ RFSoC, the accelerator achieves sub-millisecond latency, providing up to 24x speed-up and over 33x improvement in energy efficiency compared to GPU-based solutions. Extensive evaluations demonstrate that the proposed design generalizes not only across various noise levels and user mobilities, but also to a variety of unseen channel profiles, outperforming state-of-the-art baselines. By unifying algorithmic innovation with hardware-aware design, our work presents a future-proof channel estimation solution for 5G MIMO systems.
[LG-70] raining Non-Differentiable Networks via Optimal Transport
链接: https://arxiv.org/abs/2605.01928
作者: An T. Le
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: 52 pages, 20 tables, 9 figures, submitted to Transactions on Machine Learning Research
Abstract:Neural networks increasingly embed non-differentiable components (spiking neurons, quantized layers, discrete routing, blackbox simulators, etc.) where backpropagation is inapplicable and surrogate gradients introduce bias. We present PolyStep, a gradient-free optimizer that updates parameters using only forward passes. Each step evaluates the loss at structured polytope vertices in a compressed subspace, computes softmax-weighted assignments over the resulting cost matrix, and displaces particles toward low-cost vertices via barycentric projection. This update corresponds to the one-sided limit of a regularized optimal-transport problem, inheriting its geometric structure without Sinkhorn iterations. PolyStep trains genuinely non-differentiable models where existing gradient-free methods collapse to near-random accuracy. On hard-LIF spiking networks we reach 93.4% test accuracy, outperforming all gradient-free baselines by over 60~pp and closing to within 4.4~pp of a surrogate-gradient Adam ceiling. Across four additional non-differentiable architectures (int8 quantization, argmax attention, staircase activations, hard MoE routing) we lead every gradient-free competitor. On MAX-SAT scaling from 100 to 1M variables, we sustain above 92% clause satisfaction while evolution strategies drop 8–12~pp. On RL policy search, we match OpenAI-ES on classical control and retain performance under integer and binary quantization that collapses gradient-based methods. We prove convergence to conservative-stationary points at rate O(\log T/\sqrtT) on piecewise-smooth losses, upgraded to Clarke-stationary on the headline architectures and extended to the piecewise-constant regime via a hitting-time bound. These rates match the known zeroth-order query-complexity lower bounds that all forward-only methods inherit. Code is available at this https URL. Comments: 52 pages, 20 tables, 9 figures, submitted to Transactions on Machine Learning Research Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Optimization and Control (math.OC) Cite as: arXiv:2605.01928 [cs.LG] (or arXiv:2605.01928v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-71] Deep learning-based pavement performance modeling using multiple distress indicators and road work history
链接: https://arxiv.org/abs/2605.01914
作者: Lu Gao,Zhe Han,Yunshen Chen
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The deterioration of pavement is a complex and dynamic process determined by different factors including material, environment, design, and some other unobserved variables. Accurate predictions of pavement condition can help maximize the use of available resources for pavement management agencies through better coordinated preservation and maintenance activities. This paper uses deep neural networks such as the convolutional neural network (CNN) and the long short-term memory (LSTM) to model the pavement deterioration process. In this paper, pavement condition data and maintenance and rehabilitation history collected by the Texas Department of Transportation over the past 18 years were used. Twenty-one flexible pavement condition indicators, including cracking, rutting, raveling, and roughness, collected from more than 100,000 pavement sections were included in the proposed models. Promising preliminary results were obtained. Case study results show that the proposed CNN model outperforms standard machine learning models in predicting pavement condition values.
[LG-72] How Label Imbalance Shapes Geometry: A General Spectral Analysis of Multi-Label Neural Collapse
链接: https://arxiv.org/abs/2605.01897
作者: Xiaoxuan Ma,Yixuan Yang,Song Li,Xiangyun Hui
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work investigates the phenomenon of Neural Collapse (NC) in multi-label classification, extending its conceptual framework from multi-class learning to general correlated and imbalanced multi-label settings. Although recent studies have identified a ‘‘tag-wise averaging’’ structure for multi-label features, this view relies on implicit assumptions of label balance and combinatorial symmetry. Consequently, it fails to account for the geometrical distortions caused by intrinsic label correlations and data imbalance, which are common in practice. We resolve the multiplicity-one imbalance conjecture raised by Li et al. (2024), showing that higher-multiplicity prototypes obey a class-frequency-weighted synthesis rule rather than uniform averaging. To address this, we propose a rigorous spectral-control framework to analyze the terminal phase of multi-label learning under general imbalanced conditions. We introduce the label covariance spectrum \kappa_m , a scalar controlling the distribution-dependent lower-bound geometry, derived from the second-order moment matrix of the label distribution. Contrary to the averaging perspective, our analysis reveals that the centered label covariance spectrum controls the stability of terminal geometry by quantifying the weakest centered inter-class contrast directions. We prove that the classical Tag-wise Averaging emerges only as a special case under perfect orthogonality. Numerical experiments on synthetic distributions validate our theoretical bounds. This work resolves the scaled-average aspect of the imbalance conjecture and establishes a unifying theoretical framework that extends Neural Collapse to complex, imbalanced multi-label settings.
[LG-73] Robust Conditional Conformal Prediction via Branched Normalizing Flow
链接: https://arxiv.org/abs/2605.01868
作者: Rui Xu,Xingyuan Chen,Wenxing Huang,Minxuan Huang,Weiyan Chen,Sihong Xie,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction (CP) constructs prediction sets with marginal coverage guarantees under the assumption that the calibration and test distributions are identical. However, under distribution shift, existing approaches primarily align marginal conformal score distributions, which is sufficient to preserve marginal coverage but does not control the conditional coverage error at individual test inputs. As a consequence, CP can remain unreliable in regions where the conditional score distributions are mismatched. In this work, we bound the conditional invalidity of CP under distribution shift in terms of the Wasserstein distance between the calibration and test distributions. This result highlights the role of invertible transport in mitigating conditional coverage degradation. Motivated by this insight, we introduce Branched Normalizing Flow (BNF), a two-branch architecture that normalizes a test input to the calibration distribution and transforms the prediction set of the normalized input back to the test distribution while preserving conditional guarantees. Empirically, BNF consistently improves conditional coverage robustness on nine datasets across a wide range of confidence levels.
[LG-74] QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL ICML2026
链接: https://arxiv.org/abs/2605.01862
作者: Xing Lei,Jincheng Wang,Xuetao Zhang,Donglin Wang
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose \textbfQHyer, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that \textbfQHyer achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.
[LG-75] Learning Koopman operators for coupled systems via information on governing equations of subsystems
链接: https://arxiv.org/abs/2605.01835
作者: Tatsuya Naoi,Jun Ohkubo
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures
Abstract:Nonlinear coupled systems are ubiquitous in science and engineering. The analysis and modeling of such systems is challenging due to their high dimensionality and complex interactions among subsystems. In recent years, operator-theoretic methods based on the Koopman operator have attracted attention as a powerful tool for analyzing and modeling nonlinear dynamical systems. Extended dynamic mode decomposition (EDMD) is one of the most popular methods to approximate the Koopman operator. However, EDMD is a purely data-driven method, and it could be unstable and inaccurate for coupled systems under limited data availability. In this paper, we propose a method to learn the Koopman operator for coupled systems using the differential equations governing each subsystem. We also demonstrate its effectiveness through numerical experiments on coupled oscillator systems.
[LG-76] Molecular Representations for Large Language Models
链接: https://arxiv.org/abs/2605.01822
作者: Nicholas T. Runcie,Fergus Imrie,Charlotte M. Deane
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly being used to support scientific discovery. In chemistry, tasks such as reaction prediction and structure elucidation require reasoning about the structures of molecules. As such, LLM-based systems for chemistry must interact reliably with molecular structures. Most previous studies of LLMs in chemistry have used SMILES strings or IUPAC names as molecular representations; however, the suitability of these formats has not been systematically assessed. In this work, we introduce MolJSON, a novel molecular representation for LLMs, and systematically compare it with five common chemical formats. We evaluated each representation with GPT-5-nano, GPT-5-mini, GPT-5, and Claude Haiku 4.5 using a set of 78,045 questions spanning translation, shortest path, and constrained generation reasoning tasks. We observed substantial variation across representations in the ability of LLMs to interpret and generate molecular graphs, with MolJSON consistently outperforming existing formats. On translation tasks, GPT-5 achieved 71.0% accuracy when converting IUPAC names to MolJSON, compared with 43.7% when converting the same inputs to SMILES. For constrained generation, GPT-5 reached 95.3% accuracy generating MolJSON, compared with 76.3% for IUPAC and 64.0% for SMILES. As an input format for shortest-path reasoning, GPT-5 successfully answered 98.5% of questions with MolJSON, compared with 92.2% for SMILES and 82.7% for IUPAC, whilst also using fewer reasoning tokens. We observed systematic errors associated with atom count and ring complexity for SMILES strings and IUPAC names, whereas MolJSON was more robust to these failure modes. Our results show that the choice of molecular representation has a material impact on LLM performance, and that explicit molecular graph schemas, such as MolJSON, are a promising direction for LLM-based systems in chemistry.
[LG-77] Skipping the Zeros in Diffusion Models for Sparse Data Generation ICML2026
链接: https://arxiv.org/abs/2605.01817
作者: Phil Sidney Ostheimer,Mayank Nagda,Andriy Balinskyy,Gabriel Vicente Rodrigues,Jean Radig,Carl Herrmann,Stephan Mandt,Marius Kloft,Sophie Fellenz
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026
Abstract:Diffusion models (DMs) excel on dense continuous data, but are not designed for sparse continuous data. They do not model exact zeros that represent the deliberate absence of a signal. As a result, they erase sparsity patterns and perform unnecessary computation on mostly zero entries. With Sparsity-Exploiting Diffusion (SED), we model only non-zero values, preserving sparsity. SED delivers computational savings while maintaining or improving generation quality by skipping zeros during training and inference. Across physics and biology benchmarks, SED matches or surpasses conventional DMs and domain-specific baselines, while vision experiments provide intuitive insights into the limitations of dense DMs and the benefits of SED.
[LG-78] Beyond ECE: Calibrated Size Ratio Risk Assessment and Confidence-Weighted Metrics
链接: https://arxiv.org/abs/2605.01796
作者: Fernando Martin-Maroto,Nabil Abderrahaman,Gonzalo G. de Polavieja
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability P_\mathrmrisk that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy \mathrmcwA is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and on fifteen real datasets with and without post-hoc calibration. Experiments demonstrate that CSR achieves near-perfect sensitivity and specificity across all tested conditions.
[LG-79] Zero-Shot Safe and Time-Efficient UAV Navigation via Potential-Based Reward Shaping Control Lyapunov and Barrier Functions
链接: https://arxiv.org/abs/2605.01787
作者: Ashik Abrar Naeem,Mohammad Ariful Haque
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Autonomous navigation and obstacle avoidance remain a core challenge of modern Unmanned Aerial Vehicles (UAVs). While traditional control methods struggle with the complexity and variability of the environment, reinforcement learning (RL) enables UAVs to learn adaptive behaviors through interaction with the environment. Existing research with RL prioritizes the mission success at the expense of mission time and safety of UAVs. This study integrates Potential Based Reward Shaping (PBRS) with Control Lyapunov Functions (CLF) and Control Barrier Functions (CBF) to simultaneously optimize mission time and ensure formal safety guarantees. An RL model is trained in a generalized simple environment, then used in complex scenarios incorporating a CLF-CBF-QP filter without further training. Experimental results in simulated environments demonstrate a significant reduction in mission time and outstanding performance in complex environment.
[LG-80] Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms
链接: https://arxiv.org/abs/2605.01778
作者: Tian Xu,Zhilong Zhang,Zexuan Chen,Ruishuo Chen,Yihao Sun,Yang Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adversarial imitation learning (AIL), a prominent approach in imitation learning, has achieved significant practical success powered by neural network approximation. However, existing theoretical analyses of AIL are primarily confined to simplified settings, such as tabular and linear function approximation, and involve complex algorithmic designs that impede practical implementation. This creates a substantial gap between theory and practice. This paper bridges this gap by exploring the theoretical underpinnings of online AIL with general function approximation. We introduce a novel framework called optimization-based AIL (OPT-AIL), which performs online optimization for reward learning coupled with optimism-regularized optimization for policy learning. Within this framework, we develop two concrete methods: model-free OPT-AIL and model-based OPT-AIL. Our theoretical analysis demonstrates that both variants achieve polynomial expert sample complexity and interaction complexity for learning near-expert policies. To the best of our knowledge, they represent the first provably efficient AIL methods under general function approximation. From a practical standpoint, OPT-AIL requires only the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods across several challenging tasks.
[LG-81] Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation
链接: https://arxiv.org/abs/2605.01772
作者: Zhilong Zhang,Wenyu Luo,Haonan Wang,Yifei Sheng,Yidi Wang,Hanyuan Guo,Haoxiang Ren,Xinghao Du,Yuhan Che,Tongtong Cao,Lei Yuan,Yang Yu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models have emerged as a powerful paradigm for embodied intelligence, enabling robots to perform tasks based on natural language instructions and current visual input. However, existing VLA models struggle with long-horizon tasks due to compounding errors. Prior methods decompose tasks into subtasks of fixed granularity, which cannot adapt to the varying complexity of execution states, limiting their robustness in long-horizon tasks. To overcome this, we introduce Anticipation Model, which adaptively and recursively generates future subgoals. This model continuously adapts as the task unfolds, adjusting future subgoals in response to evolving dynamics, facilitating more reliable planning paths. Building on this concept, we propose Anticipation-VLA, a hierarchical VLA model that leverages the anticipation model to generate actionable subgoals that guide VLA policy execution. We implement Anticipation-VLA with finetuning a Unified Multimodal Model (UMM) for high-level subgoal generation and a goal-conditioned VLA policy for low-level action execution. Experiments in both simulated and real-world robotic tasks demonstrate the effectiveness of Anticipation-VLA, highlighting the importance of adaptive and recursive subgoal generation for robust policy execution.
[LG-82] he (Marginal) Value of a Search Ad: An Online Causal Framework for Repeated Second-price Auctions ICML2026
链接: https://arxiv.org/abs/2605.01756
作者: Yuxiao Wen,Zihao Hu,Yanjun Han,Yuan Yao,Zhengyuan Zhou
类目: Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: To appear in ICML 2026
Abstract:Existing auto-bidding algorithms in digital advertising often treat the value of an ad opportunity as the revenue obtained when an ad is shown and/or clicked, and bid accordingly. This can lead to wasteful spending because the true value is the marginal gain from paid exposure: even without winning a sponsored slot, an advertiser may still earn revenue via an organic search result (e.g., on Google or Amazon). Motivated by recent work, we model ad value as a treatment effect–the outcome difference between winning and losing the auction–and study online learning for bidding in second-price (Vickrey) auctions under this causal perspective. We develop algorithms that attain rate-optimal regret under several feedback models. A key ingredient exploits the information revealed by the second-price payment rule, which strictly improves regret relative to analogous learning problems in first-price auctions.
[LG-83] Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions
链接: https://arxiv.org/abs/2605.01752
作者: Youngmin Oh
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study linear dueling bandits in volatile environments characterized by the simultaneous presence of post-serving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget \mathcalC . To address these challenges, we propose \term, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of \widetilde\mathcalO(d(\sqrtT + \mathcalC + \mathcalD)) , where d is the total feature dimension and \mathcalD encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a \sqrtd factor for adversarial delays in the absence of post-serving contexts.
[LG-84] Stable GFlowNets with Probabilistic Guarantees ICML2026
链接: https://arxiv.org/abs/2605.01729
作者: Zengxiang Lei,Ananth Shreekumar,Jonathan Rosenthal,Ruoyu Song,Alvaro A. Cardenas,Daniel J. Fremont,Dongyan Xu,Satish Ukkusuri,Z. Berkay Celik
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to ICML2026
Abstract:Generative Flow Networks (GFlowNets) learn to sample states proportional to an unnormalized reward. Despite their theoretical promise, practical training is often unstable, exhibiting severe loss spikes and mode collapse. To tackle this, we first assess the sensitivity of GFlowNet objectives, demonstrating that a small Total Variation (TV) distance between the learned and target distributions does not preclude unbounded training loss. Motivated by this mismatch, we establish converse guarantees by deriving loss-to-TV bounds that certify global fidelity from bounded trajectory balance losses. Lastly, we propose Stable GFlowNets, an algorithm that leverages our theoretical results to stabilize training, and empirically demonstrate improved training behavior and superior distributional fidelity.
[LG-85] CoAction: Cross-task Correlation-aware Pareto Set Learning
链接: https://arxiv.org/abs/2605.01712
作者: Xinyue Chen,Yingxuan Liang,Yiqin Huang,Chikai Shang,Hai-Lin Liu,Fangqing Gu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICIC 2026 (Oral)
Abstract:Pareto set learning (PSL) is an emerging paradigm in multi-objective optimization that trains neural networks to map preference vectors to Pareto optimal solutions. However, existing PSL methods primarily focus on solving a single multi-objective optimization problem at a time. This limitation not only increases computational costs in multi-objective multitask optimization scenarios by requiring a separate model for each task, but also fails to exploit the inter-task correlations across tasks. To address this, we propose a Cross-tAsk correlation-aware Pareto Set Learning (CoAction) framework, which leverages task-aware transformer to handle multiple tasks simultaneously. Specifically, by assigning task-specific embedding vectors to individual tasks, the model effectively distinguishes between tasks while facilitating knowledge sharing among them. We utilize a Transformer encoder as the backbone architecture to leverage its self-attention mechanism for capturing complex task dependencies. The proposed approach is evaluated on comprehensive multitask test suites covering both benchmark problems and real-world applications, demonstrating effectiveness and competitive performance in Hypervolume, Range, and Sparsity.
[LG-86] oward Resilient 5G Networks: Comparative Analysis of Federated and Centralized Learning for RF Jamming Detection
链接: https://arxiv.org/abs/2605.01705
作者: Samhita Kuili,Mohammadreza Amini,Burak Kantarci
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 9 figures, accepted to 2026 IEEE International Conference on Cyber Security and Resilience (CSR)
Abstract:Jamming attacks are proliferating and pose a significant threat to the security of 5G and beyond networks. These attacks target 5G radio frequency (RF) domain and can disrupt the communication in wireless networks. While conventional machine learning and deep learning approaches demonstrate its potential for jamming detection, they typically require centralized data collection, compromising the privacy of user equipment (UEs). This work proposes a federated learning (FL)-based jamming detection framework that operates on over-the-air In-phase and Quadrature (IQ) samples extracted from Synchronization Signal Blocks (SSBs) in the RF domain. The framework enables collaborative model training across multiple UEs without sharing raw RF signal data. We adopt Federated Averaging (FedAvg) algorithm to train a 1D convolutional neural network (1DCNN) for effective detection of attacks. Numerical results demonstrate that the proposed FL framework achieves 97% accuracy and 97% F1-score, outperforming centralized baselines including MLP, 1DCNN, SVM, and logistic regression, while preserving the data privacy of all participating UEs
[LG-87] Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
链接: https://arxiv.org/abs/2605.01702
作者: Sejun Park,Yeachan Park,Geonho Hwang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm D^\mathttAD . We first show that given a floating-point function \phi (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^\mathttAD(\phi\circ f) , respectively. We further extend this result: given \phi_1,\dots,\phi_n , D^\mathttAD(\phi_i\circ f) can simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., \mathrmReLU , \mathrmELU , \mathrmGeLU , \mathrmSwish , \mathrmSigmoid , and \mathrmtanh .
[LG-88] Stability and Generalization for Decentralized Markov SGD IJCAI2026
链接: https://arxiv.org/abs/2605.01701
作者: Jiahuan Wang,Ziqing Wen,Ping Luo,Dongsheng Li,Tao Sun
类目: Machine Learning (cs.LG)
*备注: To appear in IJCAI 2026
Abstract:Stochastic gradient methods are central to large-scale learning, yet their generalization theory typically relies on independent sampling assumptions. In many practical applications, data are generated by Markov chains and learning is performed in a decentralized manner, which introduces significant analytical challenges. In this work, we investigate the stability and generalization of decentralized stochastic gradient descent (SGD) and stochastic gradient descent ascent (SGDA) under Markov chain sampling. Leveraging a stability-based framework, we characterize how Markovian dependence and decentralized communication jointly influence generalization behavior. Our analysis captures the effects of network topology, Markov chain mixing properties, and primal-dual dynamics. We establish non-asymptotic generalization bounds for both algorithms, extending existing results on Markov stochastic gradient methods to decentralized and minimax settings.
[LG-89] Complex Diffusion Maps with ω-Parameterized Kernels Revealing Inherent Harmonic Representations
链接: https://arxiv.org/abs/2605.01691
作者: Tongzhen Dang,Weiyang Ding,Michael K. Ng
类目: Machine Learning (cs.LG)
*备注: 27 pages main text, 13 pages appendix, 9 figures, 2 tables. Submitted to IEEE TPAMI. Code will be made publicly available upon acceptance
Abstract:In this paper, we propose Complex Diffusion Maps (CDM), a novel diffusion mapping framework that aims to reveal the dominant complex harmonics of high-dimensional data. Inspired by the local Gaussian kernel relevant to the heat equation and the nonlocal Schrödinger kernel relevant to the Schrödinger equation, we propose a unified family of \omega -parameterized complex-valued kernels for the trade-off between local and nonlocal connections. We establish the theoretical foundation based on the operator spectrum theory, where the corresponding diffusion operator, diffusion distance, and complex harmonic maps are well-defined. An optimization-based interpretation of the maps is also developed, aiming to preserve angular structure in the complex diffusion space rather than relying solely on real-valued magnitude. We extensively evaluate CDM on both synthetic and real-world datasets. The complex-valued kernel amplifies differences among easily confusable samples, improving discriminative power over both linear and nonlinear methods based on real-valued kernels. CDM remains robust in high-noise settings, yielding a clearer eigengap that enhances spectral separation. For resting-state fMRI data, CDM captures more strongly correlated and nonlocal spatiotemporal dynamics. Without task-specific tuning, CDM achieves competitive performance on a public EEG sleep dataset, while maintaining high computational efficiency compared with both traditional machine learning and deep neural network approaches, highlighting its generality and practical value.
[LG-90] Benchmarking Single-Pose Docking Consensus Rescoring and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock AutoDock-GPU GNINA and DiffDock-NMDN
链接: https://arxiv.org/abs/2605.01681
作者: Youssef Abo-Dahab,Xiaoiang Xiang,Xiaoiang Xiang,Xiaoiang Xiang
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Virtual screening performance depends heavily on the chosen docking and scoring methods. Recent AI-based tools such as DiffDock and NMDN have reported strong benchmark results, but their practical utility on realistic, experimentally-derived datasets remains unclear. Here we perform a large-scale evaluation on the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimentally confirmed actives and inactives). We compare AutoDock-GPU and DiffDock for pose generation, followed by rescoring with GNINA and NMDN. We further evaluate rank-based consensus strategies and supervised machine learning models trained on docking features. GNINA rescoring of AutoDock-GPU poses (AutoDock-GNINA) emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA). Our results highlight that even the best classical+ML hybrid workflows provide only modest early enrichment on realistic benchmarks. We conclude that no single docking method dominates across targets and that rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening. Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM) Cite as: arXiv:2605.01681 [cs.LG] (or arXiv:2605.01681v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-91] owards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning ICML2026
链接: https://arxiv.org/abs/2605.01663
作者: Sungyoung Lee,Dohyeong Kim,Eshan Balachandar,Zelal Su Mustafaoglu,Keshav Pingali
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ICML 2026
Abstract:We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at this https URL.
[LG-92] Geospatial foundation-model embeddings improve population estimation unevenly across space and scale
链接: https://arxiv.org/abs/2605.01650
作者: Wenbin Zhang,Eimear Cleary,Francisco Rowe,Somnath Chaudhuri,Maksym Bondarenko,Shengjie Lai,Andrew J. Tatem
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reliable subnational population estimates are essential for applications, yet remain difficult where censuses are sparse, outdated or spatially coarse. Existing population-mapping workflows rely on hand-built geospatial covariates, such as settlement extent, night-time lights, and environmental conditions, which must be assembled and harmonised across scales and geographies. Geospatial foundation models offer an alternative by learning reusable representations of place from more multifaceted and heterogeneous data sources. Here, we benchmark Population Dynamics Foundation Model (PDFM) embeddings against the harmonised geospatial covariates for subnational population estimation in Brazil, Nigeria and the United States. Under geographically structured validation, PDFM increased predictive fit by a median of 20.1% (IQR: 10.0-33.2%, across country-model comparisons) reduction in unexplained variance, and reduced Kullback-Leibler divergence by 23.2% (9.2-26.2%). However, these gains were uneven. PDFM was most advantageous where the geospatial covariates weakly characterised settlement context, such as larger and less-developed subnational areas. Moreover, PDFM performance was scale-coupled with embeddings providing less flexible transfer across spatial aggregations than geospatial covariates. These findings showed that geospatial foundation-model representations of place can improve population estimation in data poor settings, but their benefits break down predictably under spatial scale mismatch, revealing a fundamental limitation of current geospatial AI.
[LG-93] Adaptive Pluralistic Alignment: A pipeline for dynamic artificial democracy
链接: https://arxiv.org/abs/2605.01642
作者: Rachel Freedman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prevailing alignment methods target a fixed set of preferences and therefore risk forcing value lock-in as societal norms evolve over time. We introduce Adaptive Pluralistic Alignment (APA), a modular pipeline for updating pluralistically aligned AI systems to track evolving values and avoid value lock-in without repeating costly pretraining or large-scale data collection. APA has three stages: (1) learning compact personalized reward models via low-rank reward basis decomposition, (2) using these models as a jury that collectively selects among candidate outputs through social-choice-theoretic voting, and (3) efficiently adapting the jury over time by fitting new annotator weights over the fixed reward bases as values shift. The resulting system is efficient, explainable, steerable, and modular. We implement a proof-of-concept instantiation using the PRISM multi-user alignment dataset and simulated historical annotators, and provide preliminary analysis showing that jury composition and the choice of voting rule can substantially affect outcomes, particularly when jury preferences are heterogeneous. We provide full code and resulting preference datasets at this https URL.
[LG-94] he Banach-Butterfly Invariant: Influence-Adaptive Walsh Geometry for Ternary Polynomial Threshold Functions
链接: https://arxiv.org/abs/2605.01637
作者: Gorgi Pavlov
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
*备注: 21 pages, 3 figures. Theory paper; LLM-application companion in preparation. Code, certificates, and 616,126 NPN-canonical n=5 representatives in supplementary repository
Abstract:We introduce the Banach-Butterfly Invariant (BBT), an influence-adaptive Banach geometry on the Walsh-Hadamard butterfly factorization. For a Boolean function f:-1,+1^n\to-1,+1\ with coordinate influences \mathrmInf_\ell(f) , BBT assigns exponent p_\ell = 1+\mathrmInf_\ell(f) to butterfly layer \ell , yielding the contraction invariant \mu(f)=\prod_\ell 2^-\mathrmInf_\ell/(1+\mathrmInf_\ell) . We prove a Jensen lower bound \log_2\mu(f) \ge -I(f)/(1+I(f)/n) and that \mu is strictly Schur-convex in the influence vector (modulo permutation), giving scaling classes \mu\sim 2^-n/2 (parity), 2^-\Theta(\sqrtn) (majority), 2^-1/2 (dictators). \log_2\mu is rational but not polynomial in the Fourier coefficients while \mu is algebraic, and \mu separates functions with identical total influence (122 pairs at n=3 ). Using the certified n \le 4 ternary Walsh-threshold universe from a companion synthesis manuscript as a finite testbed, we compute exact MILP minimum-support certificates for all 65,536 Boolean functions at n=4 (mean 6.42, max 9, all-odd by a parity argument) and on 10,000 of the 616,126 NPN-canonical representatives we enumerate at n=5 (matching OEIS A000370). Conditional Spearman \rho(\mu,|\mathrmsupp|) at fixed total influence is +0.571 in the largest stratum at n=4 but reverses to -0.38 at n=5 under both function-uniform and NPN-canonical sampling: \mu is a valid Schur-convex concentration invariant, not a universal monotone predictor of minimum support across n . A companion application paper validates a real-valued WHT activation-energy proxy inspired by this theory on five pretrained LLMs at W2A16, cutting wikitext-2 perplexity by 15-58% versus vanilla auto-round; the transfer from Boolean theory to the real-valued proxy is qualitative, not formal. Comments: 21 pages, 3 figures. Theory paper; LLM-application companion in preparation. Code, certificates, and 616,126 NPN-canonical n=5 representatives in supplementary repository Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Combinatorics (math.CO) ACMclasses: F.2.2; G.2.1; F.1.3 Cite as: arXiv:2605.01637 [cs.LG] (or arXiv:2605.01637v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-95] Chebyshev-Augmented One-Shot Transfer Learning for PINNs on Nonlinear Differential Equations ICLR2026
链接: https://arxiv.org/abs/2605.01634
作者: Yiqi Rao,Pavlos Protopapas
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, 9 tables, accepted to ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations
Abstract:Physics-Informed Neural Networks (PINNs) offer a flexible paradigm for solving differential equations by embedding governing laws into the training objective. A persistent limitation is instance specificity: standard PINNs typically require retraining for each new forcing term, boundary/initial condition, or parameter setting. One-shot transfer learning (OTL) addresses this bottleneck for linear operators by freezing a pretrained latent representation and computing optimal output weights in closed form, but for nonlinear problems closed-form adaptation is generally unavailable because the loss is nonconvex in the output layer. In this paper we substantially broaden the class of nonlinearities amenable to one-shot PINN transfer by combining OTL with Chebyshev polynomial surrogates. We approximate general smooth weakly nonlinear terms by truncated Chebyshev expansions over a prescribed solution range, yielding a polynomial nonlinearity that can be handled by a perturbative decomposition into linear subproblems. A multi-head PINN learns a reusable latent space associated with the dominant linear operator; at test time, solutions to new instances are obtained via a sequence of closed-form linear solves in the output layer, without retraining the network body. We provide a unified derivation of the framework for ODEs and PDEs and demonstrate accuracy and fast online adaptation on nonlinear benchmarks, including non-polynomial and singular ODE nonlinearities as well as a reaction-diffusion PDE with saturating kinetics, demonstrating the method’s utility in many-query regimes. Comments: 18 pages, 4 figures, 9 tables, accepted to ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations Subjects: Machine Learning (cs.LG) MSC classes: 68T07 ACMclasses: I.2 Cite as: arXiv:2605.01634 [cs.LG] (or arXiv:2605.01634v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-96] Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
链接: https://arxiv.org/abs/2605.01632
作者: Eleanor Quint
类目: Machine Learning (cs.LG)
*备注:
Abstract:Models that are indistinguishable on in-distribution data can behave very differently under distribution shift. We introduce Perturb-and-Correct (PC), a post-hoc method for constructing epistemically diverse predictors from a single pretrained network. PC applies random hidden layer perturbations with a least-squares correction in the subsequent affine layer, producing predictors that agree on calibration data while remaining free to disagree away from it. We analyze this mechanism through the post-correction residual and its first-order sensitivity: the residual is controlled near the calibration distribution by a leverage term, while corrected sensitivity grows as inputs deviate from the calibration geometry. Empirically, PC achieves a strong ID/OOD tradeoff across MuJoCo dynamics prediction and CIFAR-10 OOD detection, matching or outperforming standard post-hoc baselines while requiring only a single pretrained model. Our findings highlight the potential in further exploiting overparameterization as a strength of deep learning models.
[LG-97] Importance-Guided Basis Selection for Low-Rank Decomposition of Large Language Models
链接: https://arxiv.org/abs/2605.01627
作者: Daniel Agyei Asante,Ernie Chang,Yang Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-rank decomposition is a compelling approach for compressing large language models, but its effectiveness hinges on selecting which singular-vector bases to retain for a target task. Existing methods such as Basel adapt singular-value coefficients on downstream data and prune bases with small re-learned magnitudes, a heuristic that can be misaligned with task performance because it ignores the local geometry of the loss landscape. We present Basis Selection with Importance (BSI), a principled low-rank compression framework that ranks and prunes bases by directly estimating the expected loss increase incurred when each basis is removed. BSI derives a derivative-based importance score from a second-order Taylor expansion of the task loss with respect to singular values, combining first-order sensitivity and second-order curvature to quantify pruning impact. To make this criterion practical for LLMs, we develop an efficient Hessian-diagonal estimator by adapting the Hutchinson randomized-probing method to loss curvature with symmetric parameter perturbations. We provide a comprehensive theoretical analysis, including loss-increase bounds under basis pruning, explicit propagation of Hessian-diagonal estimation error into these bounds, variance characterization tied to the Hessian spectrum, high-probability sample-complexity guarantees for achieving a target estimation accuracy, and guidance on perturbation intensity. Extensive experiments on mathematical reasoning benchmarks demonstrate that BSI consistently outperforms state-of-the-art low-rank decomposition baselines, with especially strong improvements under deep compression.
[LG-98] PRIME: Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies
链接: https://arxiv.org/abs/2605.01625
作者: Viet Thanh Duy Nguyen,John K. Johnstone,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Proteins are inherently multiscale physical systems whose functional properties emerge from coordinated structural organization across multiple spatial resolutions, ranging from atomic interactions to global fold topology. However, existing protein representation learning methods typically operate at a single structural level or treat different sources of structural information as parallel modalities, without explicitly modeling their hierarchical relationships. We introduce PRIME (Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies), a unified framework that models proteins as a nested family of five physically grounded structural graphs spanning surface, atomic, residue, secondary-structure, and protein levels. Adjacent levels are connected through deterministic, physics-informed assignment operators, enabling bidirectional information exchange via bottom-up aggregation and top-down contextual refinement. Experiments on standard protein representation learning benchmarks demonstrate strong and competitive performance across diverse tasks, with particularly notable gains on the Fold Classification benchmark, where PRIME outperforms the strongest geometric GNN baseline by margins of 13.80 and 18.30 points on the harder Superfamily and Fold splits, and achieves a state-of-the-art accuracy of 84.10% on Reaction Class prediction, surpassing all baseline methods, including ESM. Ablation studies confirm that each structural level contributes complementary and non-redundant information, and adaptive cross-attention analysis reveals that PRIME autonomously identifies the most task-relevant structural resolutions at prediction time. Our source code is publicly available at this https URL
[LG-99] Hybrid Quantum Reinforcement Learning with QAOA for Improved Vehicle Routing Optimization
链接: https://arxiv.org/abs/2605.01574
作者: T. Satyanarayana Murthy,B. Swathi Sowmya,Santhosh Voruganti,Sai Varshini Giridi,Chaitanyya Pratap Agarwal,Vanteddu Akshitha
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vehicle Routing Problem (VRP) is one of the most complex NP-hard combinatorial optimization problem in transportation and logistics that requires a dynamic solution approach. In this paper we present a new hybrid approach that combines the Quantum Approximate Optimization Algorithm (QAOA) into the QRL policy network, instead of the usual variational layers, QAOA mixing and cost Hamiltonian layers. This enhancement enables the agent to exploit problem specific particular quantum correlations when learning policies, and so richer exploration of the routing solution space. The QAOA-augmented QRL framework shows quicker convergence in training and can tackle larger VRP instances that are beyond the reach of Grover’s Adaptive Search (GAS) and Quantum Reinforcement Learning (QRL) approaches. Experiments on standard VRP instances demonstrate better solutions, fewer episodes to converge and good memory usage on near term quantum hardware simulators. These findings demonstrate QAOA- integrated QRL as a viable approach to scalable, high quality quantum-assisted combinatorial optimization.
[LG-100] Evaluating LLM s on Large-Scale Graph Property Estimation via Random Walks ACL2026
链接: https://arxiv.org/abs/2605.01484
作者: Sunil Kumar Maurya,Xin Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ACL 2026 Main Conference
Abstract:With the rapidly improving reasoning abilities of Large Language Models (LLMs), there is also a rising demand to use them in a wide variety of domains. This brings about the need to carefully evaluate the limits of the capabilities of these models with various tests and benchmarks. Graph structures are ubiquitous in real-world data, and are often used to represent and analyze relationship patterns within data. Many benchmarks have already been proposed in the graph literature to test the reasoning ability of LLMs to follow and execute graph algorithms. However, due to the limited context length of LLMs, these benchmarks consist of very small graphs. In real-world data, the size of graphs can be significantly larger, and in many cases, not fully accessible. In this paper, we examine a class of problems that arises with very large graphs having limited accessibility. We propose a large graph benchmark dataset, EstGraph, and introduce four distinct tasks designed to estimate large graph properties. We evaluate the reasoning abilities of LLMs on these tasks using a wide variety of graph datasets. In addition, we provide task-specific prompt constructions based on random walk sampling of large graphs (up to millions of nodes) that effectively convey sufficient information to LLMs within the limits of context length.
[LG-101] Barriers to Counterfactual Credit Attribution for Autoregressive Models ICML2026
链接: https://arxiv.org/abs/2605.01425
作者: Aloni Cohen,Chenhao Zhang
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Generative AI disrupts the practice of giving credit to work that came before. Ideally, a generative model would give credit to any work on which its output depends in a significant way. \emphCounterfactual credit attribution (CCA) is a technical condition formalizing this goal–a relaxation of differential privacy–recently introduced by Livni, Moran, Nissim, and Pabbaraju [2024] who studied it in the PAC learning setting. We initiate the study of CCA generative models. Specifically, we consider autoregressive models giving credit to a deployment-time dataset (e.g., a RAG database). We uncover barriers to two natural approaches to CCA autoregressive models. First, we show that imposing CCA on the underlying next-token predictor does not guarantee that the model is CCA: CCA does not compose autoregressively (unlike DP). Second, we consider a different approach to building CCA models which we call \emphretrofitting. Retrofitting takes a model that does not attribute credit, and adds credit onto it. We prove a lower bound for CCA retrofitting under a weak optimality requirement. Given black-box access to the starting model, retrofitting requires query complexity exponential in the length of the model’s outputs. Comments: ICML 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.01425 [cs.LG] (or arXiv:2605.01425v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-102] Rethinking Multi-Label Node Classification: Do Tuned Classic GNNs Suffice?
链接: https://arxiv.org/abs/2605.01403
作者: Yuxuan Xiao,Shengzhong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-label node classification (MLNC) has recently been addressed by increasingly complex label-aware designs that explicitly model node-label interactions and inter-label this http URL, it remains unclear whether the advantages of these methods truly stem from their specialized designs, or simply from insufficiently optimized baselines. In this paper, we revisit MLNC from a strong-baseline perspective and investigate whether carefully tuned classic full-graph GNNs can already serve as strong solutions to this task. We systematically study several representative backbones, including GCN, SSGConv, and GCNII, and optimize them using standard yet effective techniques such as normalization, dropout, and residual connections. Experiments on five representative benchmark datasets show that our tuned baselines outperform representative specialized methods on four datasets and achieve state-of-the-art performance in multiple settings. These results indicate that careful tuning of classic backbones is a highly influential but often overlooked factor in MLNC, and highlight the need for more rigorous strong-baseline evaluation in future research on multi-label graph learning.
[LG-103] Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks
链接: https://arxiv.org/abs/2605.01383
作者: Maniru Ibrahim
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Physics (physics.comp-ph)
*备注:
Abstract:Differentiable physical networks provide a simple setting in which learning can be studied through the interaction between trainable parameters and physical equilibrium constraints. We investigate sequential learning in differentiable resistor networks governed by Kirchhoff’s laws. Although individual input–output mappings can be learned by gradient-based adjustment of edge conductances, sequential training on conflicting tasks produces catastrophic forgetting. We show that forgetting is controlled by task conflict and by the degree of adaptation to the new task. Uniform anchoring and normalised gradient-weighted anchoring reduce forgetting only by increasing the final loss on the new task, giving a clear forgetting–adaptation trade-off. We also show that forgetting is associated with localised conductance changes on high-current edges, giving a physical interpretation as reconfiguration of dominant transport pathways. Broader random-task ensembles show that the strongest forgetting occurs when the second task reverses the output ordering imposed by the first task. Finally, comparisons across Erdős–Rényi, small-world, scale-free, and random-geometric graph ensembles show that topology changes the forgetting–adaptation balance. These results position differentiable resistor networks as compact, physically interpretable testbeds for studying continual learning in tunable matter.
[LG-104] oward a foundational thermal model for residential buildings
链接: https://arxiv.org/abs/2605.01364
作者: Ting-Yu Dai,Kingsley Nweye,Dev Niyogi,Zoltan Nagy
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The building energy community lacks a foundational thermal model, i.e., a single pretrained model capable of generalizing across diverse buildings, climates, and control strategies without building-specific calibration. Achieving this vision requires architectural principles that capture universal thermal dynamics rather than memorizing building-specific patterns. We take a step toward this goal by presenting a physics-informed transformer architecture that embeds domain knowledge, e.g., derivative enrichment and Euler-based numerical integration, into a decoder-only framework. We incorporate static building features extracted from simulation models and employ Rotary Position Embedding attention to capture temporal dependencies. Evaluated on the CityLearn dataset spanning 247 residential buildings across three climate zones, our model achieves one-step prediction accuracy (RMSE of 0.30°C in Texas, 0.29°C in Vermont) while outperforming both traditional baselines and fine-tuned Time-Series Foundation Models. We also demonstrate zero-shot transferability: models trained on as few as two buildings generalize to unseen buildings and climate zones without fine-tuning. Despite the limitation of simulated residential buildings, our results establish physics-informed architectural principles as a promising foundation for universal building thermal models.
[LG-105] Decision-Focused Learning via Tangent-Space Projection of Prediction Error
链接: https://arxiv.org/abs/2605.01361
作者: Junhyeong Lee,Sangjin Jin,Yongjae Lee
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, 8 tables
Abstract:Decision-Focused Learning (DFL) trains predictors to improve downstream decision quality, but computing regret gradients typically requires differentiating through solvers or relying on surrogate losses, which can be computationally expensive or deviate from the true objective. We show that, under standard regularity with locally stable active constraints, the regret gradient admits a closed-form geometric characterization, equivalent to the prediction error projected onto the tangent space of active constraints, scaled by local curvature. This reveals that regret gradients can be obtained by filtering decision-irrelevant components from the MSE gradient, providing a simpler and more direct alternative to existing approaches. Based on this, we propose PEAR (Projected Error As Regret-gradient), which computes regret gradients via a reduced linear system over active constraints, avoiding differentiation through solver iterations or additional optimization solves. Experiments on LP benchmarks and a real-world QP task show that PEAR achieves the best decision quality among all baselines while being the most computationally efficient, with gains that persist under constraint shifts.
[LG-106] PACE: Parameter Change for Unsupervised Environment Design
链接: https://arxiv.org/abs/2605.01358
作者: Fang Yuan,Quanjun Yin,Siqi Shen,Yuxiang Xie,Junqiang Yang,Long Qin,Junjie Zeng,Qinglun Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised Environment Design (UED) offers a promising paradigm for improving reinforcement learning generalization by adaptively shaping training environments, but it requires reliable environment evaluation to remain effective. However, existing UED methods evaluate environments using indirect proxy signals such as regret, value-based errors, or Monte Carlo, which suffer from bias, high variance, or substantial computational overhead and fail to reflect agent realized learning progress. To address these limitations, we propose Parameter Change Environment Design (PACE), which evaluates an environment through the policy parameter change induced by training on that environment, directly grounding environment selection in realized learning progress. Specifically, PACE assigns environment value using a first-order approximation of the policy optimization objective, where the improvement induced by an environment is proportional to the squared L2 norm of the corresponding parameter update, enabling low-variance and computation-efficient evaluation without additional rollouts. Experiments on MiniGrid and Craftax show that PACE consistently outperforms established UED baselines, achieving higher IQM and smaller Optimality Gap on OOD evaluations, including an IQM of 96.4% and an Optimality Gap of 17.2% on MiniGrid.
[LG-107] Robust Parameter Learning for Uncertain MDPs
链接: https://arxiv.org/abs/2605.01339
作者: Yannik Schnitzer,Alessandro Abate,David Parker
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning-based approaches to verifying unknown Markov decision processes (MDPs) often employ uncertain MDPs. These models use, for example, confidence intervals to capture transition uncertainty and allow synthesis of policies that are robust to this uncertainty. However, this approach typically quantifies uncertainty independently for individual transition probabilities, ignoring dependencies due to shared latent quantities. We propose to learn such models using parametric MDPs (pMDPs), where transition probabilities are expressions over a set of parameters. We project statistical uncertainty from empirical transition frequencies onto the pMDP’s parameter space, yielding a probably approximately correct (PAC) uncertainty model for the underlying MDP that respects the algebraic dependencies between transitions. The resulting models are algorithmically challenging to solve, so we propose a hierarchy of sound polytopic outer approximations of the induced confidence set. We implement and evaluate our approach, demonstrating substantially tighter uncertainty estimates than classical interval-based uncertain MDP learning techniques.
[LG-108] he Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
链接: https://arxiv.org/abs/2605.01311
作者: Jikai Jin,Vasilis Syrgkanis
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.
[LG-109] GA-VisAgent : A Multi-Agent application for code generation and visualization in interactive learning
链接: https://arxiv.org/abs/2605.01299
作者: Wang Jian,Zhou Jianbo,Xiong Yuhao,Liu Zhenxia,Luo Wen,Yuan LinWang,Yu ZhaoYuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geometric Algebra (GA) presents challenges to learners due to its highly abstract mathematical structure and complex operational rules, as translating algebraic manipulations into concrete geometric interpretations is a non-intuitive process when developing related code. Currently, some existing GA software packages rely on manually written scripts for code generation and visualization, but their high learning curve hinders widespread adoption. Meanwhile, methods based on Large Language Models (LLMs) often produce logical errors when generating specific GA scripts, such as GAALOPScript, resulting in generally low accuracy. To address these issues, this study proposes GA-VisAgent – a multi-agent interactive learning application for GA code generation and visualization – building upon a Geometric algebra large language model (GAGPT). Integrating task planning mechanisms with ReAct reasoning strategies, GA-VisAgent can decompose complex operations into five standardized subtasks, including core operations like geometric products, rotations, and reflections. It supports natural language and mathematical formulas as input to automatically generate executable code, accompanied by interactive visualizations to aid user comprehension. Experimental results show that GA-VisAgent achieved a 90% code generation success rate across 40 typical Conformal GA tasks, representing a 70% improvement over GPT-4o. This application introduces an extensible new paradigm for teaching GA and developing visualization tools for related mathematical concepts. The online service for this project will be available at this http URL.
[LG-110] Congestion-Aware Dynamic Axonal Delay for Spiking Neural Networks
链接: https://arxiv.org/abs/2605.01291
作者: Dewei Bai,Hongxiang Peng,Yunyun Zeng,Ziyu Zhang,Hong Qu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) are widely regarded as an energy-efficient paradigm for modeling and processing temporal and event-driven information. Incorporating delays in SNNs has been proven to be an effective mechanism for improving spike alignment in event-driven tasks. However, existing delay learning approaches predominantly assign static delays to individual synapses, resulting in a large number of delay parameters and limited adaptability to input-dependent activity dynamics. To this end, we propose a Congestion-Aware Dynamic Axonal Delay mechanism, decomposing the delay into a channel-wise static base delay for temporal structuring and a global, activity-conditioned shift that dynamically regulates the state update rate under varying spike intensities. The delay parameters are learned using differentiable linear interpolation and discretized at inference time, preserving the benefits of our dynamic delay while incurring only minimal additional cost. Experiments on speech benchmarks, including the Spiking Heidelberg Dataset, Spiking Speech Commands, and Google Speech Commands, demonstrate that introducing congestion-aware delays into synaptic signal transmission effectively improves accuracy on temporal tasks, notably achieving 93.75% accuracy on SHD, 80.49% accuracy on SSC, and 95.53% on GSC-35, while reducing the parameter count by approximately 50% compared to state-of-the-art delay-based methods with the same architecture.
[LG-111] A Theory of Saddle Escape in Deep Nonlinear Networks
链接: https://arxiv.org/abs/2605.01288
作者: Divit Rawal,Michael R. DeWeese
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:
Abstract:In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law \tau_\star = \Theta(\varepsilon^-(r-2)) governed by the number r of layers at the bottleneck scale rather than the total depth L . We find that this same r-2 exponent is recovered under He-normal initialization with r bottleneck layers rescaled by \varepsilon , where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.
[LG-112] Continuous Temporal Representations of Event-Based Signals via Interference-Based Wave Modeling
链接: https://arxiv.org/abs/2605.01270
作者: Magnus Bengtsson
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, Submitted to Journal
Abstract:Spatio-temporal signals arising from event-driven biological processes, such as surface electromyography (sEMG), exhibit asynchronous and highly structured activation patterns that are challenging to model using conventional discrete or purely real-valued representations. In this work, we propose a continuous temporal modeling framework based on interference-based wave representations. The approach maps event-like input signals into a complex-valued latent wave field, where temporal structure is encoded through phase modulation and interactions between latent components. By projecting the resulting wave field onto an energy domain, the model induces structured activation patterns that capture both temporal localization and relational dependencies within finite observation windows, without relying on explicit recurrence or causal state propagation. The proposed formulation is particularly suited for event-driven biosignals, where continuous representations enable efficient gradient-based optimization and robust feature extraction. In particular, the method is designed to support learning from sEMG data for downstream control tasks in biomechanical systems, such as prosthetic devices and exoskeletons. Experimental results demonstrate that the proposed interference-based wave model provides improved representation quality compared to purely real-valued representations, while maintaining computational efficiency suitable for practical deployment. Comments: 18 pages, 3 figures, Submitted to Journal Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.01270 [cs.LG] (or arXiv:2605.01270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01270 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Magnus Bengtsson [view email] [v1] Sat, 2 May 2026 06:04:55 UTC (2,003 KB) Full-text links: Access Paper: View a PDF of the paper titled Continuous Temporal Representations of Event-Based Signals via Interference-Based Wave Modeling, by Magnus BengtssonView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-113] FeedbackLLM : Metadata driven Multi-Agent ic Language Agnostic Test Case Generator with Evolving prompt and Coverag e Feedback
链接: https://arxiv.org/abs/2605.01264
作者: Kushal Jasti,Tejamani Prashanth Sahu,Rishitha Pentyala,Muvvala Mohit,Vivek Yelleti
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Traditional approaches to test case generation often involve manual effort and incur significant computational overhead. Additionally, these approaches are not scalable, and hence, unsuitable for complex software systems. Recently, Large Language Models (LLMs) have been applied to software testing. However, single-shot prompt engineering-based approaches tend to hallucinate and generate redundant test cases, resulting in fewer branches. To handle the above-mentioned limitations, in this paper, we propose FeedbackLLM, a novel automated language-agnostic test case generation framework based on tightly coupled two-stage approach. In the first stage, FeedbackLLM extracts the input constraints by parsing source code and generates the possible test cases. The quality of the test cases is evaluated in the second stage by the following two specialized LLM feedback agents: (i) Line Feedback Agent: extracts the metadata related to missed line executions and (ii) Branch Feedback Agent: extracts the metadata of the unexecuted branch conditions. The above agents operate in a two-stage process, communicating in tandem, and this procedure is repeated for k-steps. Further, we also introduced a redundancy prevention cache to avoid duplicate API requests and avoid unnecessary execution cycles. The performance of the proposed architecture is evaluated on the standard benchmark programs related to C and Python programs. FeedbackLLM demonstrated more line and branch coverage than baseline tools while scaling linearly in execution time.
[LG-114] New Bounds for Kernel Sums via Fast Spherical Embeddings ICML2026
链接: https://arxiv.org/abs/2605.01263
作者: Tal Wagner
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:We study query time bounds for the fundamental problem of estimating the kernel mean \frac1|X|\sum_x\in X\mathbfk(x,y) of a query y in a finite dataset X\subset\mathbbR^d up to a prescribed additive error \varepsilon . The best known bounds for the Gaussian kernel are O(d/\varepsilon^2) , \widetilde O(d+1/\varepsilon^4) , and \widetilde O(d+\Delta^2/\varepsilon^2) , where \Delta is the diameter of a region containing the points. We prove the new bound \tilde O(d+\varepsilon\Delta^2+1/\varepsilon^3) , which improves over the previous ones in regimes with small error \varepsilon and intermediate diameter \Delta . At the center of our proof is a new fast spherical embedding theorem in the sense introduced by Bartal, Recht and Schulman (2011), which limits the embedded data diameter while preserving local Euclidean distances and avoiding ``distance collapse’’ at larger scales. This fast embedding theorem may be of independent interest. Comments: ICML 2026 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.01263 [cs.DS] (or arXiv:2605.01263v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.01263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-115] Activation Compression in LLM s: Theoretical Analysis and Efficient Algorithm
链接: https://arxiv.org/abs/2605.01255
作者: Wen-Da Wei,Han-Bin Fang,Yang-Di Liu,Jiang-Xin Shi,James Kwok,Yu-Feng Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely focus on gradients and optimizer states, activation compression is less well established due to the lack of LLM-tailored theory and guarantees. In this work, we develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard L -smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses low-rank activation factors to compress linear-layer gradients without extra computation or additional gradient error. We conduct extensive experiments on Qwen and LLaMA models using a pretraining benchmark and multiple fine-tuning benchmarks to validate our theory and demonstrate competitive performance of our method in both accuracy and compression efficiency. We provide our code in the supplementary material for reproducibility.
[LG-116] S3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
链接: https://arxiv.org/abs/2605.01248
作者: Harsh Goel,Akhil Udathu,Susmija Jabireddy,Pradnesh Kalkar,Atharva Parulekar
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a reward structure that evaluates both intermediate search quality and the correctness of the final answer. This setup directly mitigates the credit assignment problems inherent to sparse rewards. Our evaluations show that S^3-R1 outperforms existing baselines by learning more effective search and synthesis strategies, yielding up to a 10% improvement in robust generalization on out-of-domain datasets.
[LG-117] Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs ICML2026
链接: https://arxiv.org/abs/2605.01242
作者: Ruiquan Huang,Donghao Li,Yingbin Liang,Jing Yang
类目: Machine Learning (cs.LG)
*备注: accepted by ICML2026
Abstract:Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.
[LG-118] CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models ICML2026
链接: https://arxiv.org/abs/2605.01231
作者: Xiaorui Wang,Fanda Fan,Chenxi Wang,Yuxuan Yang,Rui Tang,Kuoyu Gao,Simiao Pang,Yuanfeng Shang,Zhipeng Liu,Wanling Gao,Lei Wang,Jianfeng Zhan
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026 main track. Code available at this https URL
Abstract:Recent progress in time-series forecasting has led to rapidly increasing architectural complexity, yet many reported State-of-the-Art gains are statistically fragile or misattributed. We argue that progress requires a shift from model selection to modular attribution, identifying which components truly drive performance. We propose CombinationTS, a self-contained probabilistic evaluation framework that decomposes forecasting models into orthogonal modules–Input Transformation, Embedding, Encoder, Decoder, and Output Transformation–and evaluates them under a shared evaluation condition space. By quantifying each component via marginalized performance ( \mu ) and stability ( \sigma ), CombinationTS enables robust attribution beyond fragile point estimates. Through large-scale paired evaluation, we uncover the Identity Paradox: once the data view (Embedding) is well-designed, a parameter-free Identity Encoder often matches or outperforms complex backbones. We further show that explicit structural priors introduced via Input Transformations yield a more favorable performance-stability trade-off than increasing Encoder complexity, establishing a principled baseline for architectural necessity.
[LG-119] Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
链接: https://arxiv.org/abs/2605.01226
作者: Keyan Chen,Qiwei Yuan,Zhitong Xu,Bin Shen,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Events in spatiotemporal systems are ubiquitous, yet modeling their complex distributions remains challenging. Existing point process models often rely on strong structural assumptions and are typically limited to autoregressive, event-by-event prediction. As a result, they struggle to support broader inference tasks such as inverse inference, trajectory reconstruction, and recovery of missing event locations. We introduce Arbitrarily Conditioned Hierarchical Flows (ARCH), a hierarchical flow matching framework for spatiotemporal event modeling. ARCH is expressive enough to capture complex event distributions while enabling tractable and accurate computation of conditional intensities, which quantify instantaneous event risk. Built on a history-encoder-generative-decoder architecture, ARCH introduces a hybrid masking strategy for flexible conditioning on arbitrary observed events. This enables a unified treatment of forecasting, inverse inference, and partial trajectory recovery within a single framework. Experiments on synthetic and real-world datasets show that ARCH consistently outperforms existing baselines across both prediction and conditional inference tasks.
[LG-120] Local Hessian Spectral Filtering for Robust Intrinsic Dimension Estimation ICML2026
链接: https://arxiv.org/abs/2605.01221
作者: Genki Osada
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:While diffusion models enable new approaches for estimating Local Intrinsic Dimension (LID), existing methods fail in high-dimensional spaces where noise from vast normal directions overwhelms the tangent signal. We propose Local Hessian Spectral Dimension (LHSD), which resolves this by applying spectral filtering to the log-density Hessian, explicitly cutting off large eigenvalues associated with normal directions to count zero-curvature tangent directions. Implemented using Stochastic Lanczos Quadrature (SLQ), LHSD avoids full Hessian construction, achieving linear scalability with dimension D . Experiments on synthetic and real data confirm LHSD’s superior robustness and its utility in detecting memorization in large-scale diffusion models.
[LG-121] Focus and Dilution: The Multi-stage Learning Process of Attention ICML2026
链接: https://arxiv.org/abs/2605.01199
作者: Zheng-An Chen,Pengxiao Lin,Zhi-Qin John Xu,Tao Luo
类目: Machine Learning (cs.LG)
*备注: ICML 2026 spotlight
Abstract:Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.
[LG-122] Linear-Readout Floors and Threshold Recovery in Computation in Superposition
链接: https://arxiv.org/abs/2605.01192
作者: Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 38 pages, preprint, no figures; comments welcome
Abstract:Two recent approaches to computation in superposition reach different recursive capacity regimes: Hänni et al. certify \tildeO(d^3/2) computable features in width d via an approximate-linear recursive template, while Adler and Shavit reach near-quadratic capacity (up to logarithmic factors) using thresholded Boolean recovery. The main contribution of this paper is conceptual: we argue these results are not contradictory because they maintain different interface invariants, and we formalize the distinction. As a tool, we record a rank-trace Welch-type lower bound for biorthogonal linear readouts: for F \gg d , the worst-case off-diagonal cross-talk of any unit-diagonal linear readout is \Omega(d^-1/2) , and the bound is tight on average for unit-norm tight frames. At quadratic feature load F=d^2 , random-support threshold recovery succeeds for sparsities s=O(d/\log d) , while linear readouts still incur \Omega(s/d) average per-coordinate squared error on Bernoulli sparse states. Matching the Welch floor against the published tolerance of the Hänni correction layer explains the d^3/2 scale as a compatibility threshold for that template, not a universal upper bound. Robust nonlinear reset beyond the Hänni template is left open. Comments: 38 pages, preprint, no figures; comments welcome Subjects: Machine Learning (cs.LG); Information Theory (cs.IT) ACMclasses: I.2.6; F.1.1 Cite as: arXiv:2605.01192 [cs.LG] (or arXiv:2605.01192v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01192 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-123] A Theory of Generalization in Deep Learning
链接: https://arxiv.org/abs/2605.01172
作者: Elon Litman,Gabe Guo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present a non-asymptotic theory of generalization in deep learning where the empirical neural tangent kernel partitions the output space. In directions corresponding to signal, error dissipates rapidly; in the vast orthogonal dimensions corresponding to noise, the kernel’s near-zero eigenvalues trap residual error in a test-invisible reservoir. Within the signal channel, minibatch SGD ensures that coherent population signal accumulates via fast linear drift, while idiosyncratic memorization is suppressed into a slow, diffusive random walk. We prove generalization survives even when the kernel evolves \mathcalO(1) in operator norm, the full feature-learning regime. This theory naturally explains disparate phenomena in deep learning theory, such as benign overfitting, double descent, implicit bias, and grokking. Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by 5 \times , suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying 3 \times closer to the reference policy.
[LG-124] Metric-Normalized Posterior Leakage (mPL): Attacker-Aligned Privacy for Joint Consumption
链接: https://arxiv.org/abs/2605.01137
作者: Gaoyi Chen,Minghao Li,Weishi Shi,Yan Huang,Yusheng Wei,Sourabh Yadav,Chenxi Qiu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Metric differential privacy (mDP) strengthens local differential privacy (LDP) by scaling noise to semantic distance, but many machine learning (ML) systems are consumed under joint observation, where model-agnostic, per-record guarantees can miss leakage from evidence aggregation. We introduce metric-normalized posterior leakage (mPL), an attacker-aligned, distance-calibrated measure of posterior-odds shift induced by releases, and show that for single or independent releases, uniformly bounding mPL is equivalent to mDP. Under joint observation, however, satisfying mDP may still leave mPL high because learned aggregators compound evidence across correlated items. To make control practical, we formalize probabilistically bounded mPL (PBmPL), which limits how often mPL may exceed a target budget, and we operationalize it via Adaptive mPL (AmPL), a trust-and-verify framework that perturbs, audits with a learned attacker, and adapts parameters (with optional Bayesian remapping) to balance privacy and utility. In a word-embedding case study, neural adversaries violate mPL under joint consumption despite per-record mDP perturbations, whereas AmPL substantially lowers the frequency of such violations with low utility loss, indicating PBmPL as a practical, certifiable protection for joint-consumption settings.
[LG-125] Spectral Graph Sparsification Preserves Representation Geometry in Graph Neural Networks
链接: https://arxiv.org/abs/2605.01136
作者: Sanjukta Krishnagopal
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Spectral Theory (math.SP); Machine Learning (stat.ML)
*备注: 9 pages, 4 figures
Abstract:Spectral graph sparsification is a classical tool for reducing graph complexity while preserving Laplacian quadratic forms. In graph neural networks (GNNs), sparsification is often used to accelerate computation while maintaining predictive performance. In this work, we study a complementary representation-level question: does sparsification preserve the geometry of learned embeddings? For polynomial-filter GNNs, we prove that any \epsilon -spectral sparsifier induces O(\epsilon) perturbations in polynomial graph filters, multilayer hidden representations, and their Gram matrices. These guarantees imply stability of squared pairwise distances, class means, and covariance structure in embedding space. We further establish finite-time training stability: under smoothness and boundedness assumptions, gradient descent on dense and sparsified graphs produces weight trajectories whose separation grows at most proportionally to the sparsification distortion. Empirically, effective-resistance sparsification validates the predicted perturbation chain on synthetic graphs and preserves hidden representation geometry on real datasets. In our experiments, the gram matrix and training dynamics show low divergence even under substantial sparsification, consistent with the predicted stability under spectral sparsification. Hidden Gram preservation strongly predicts neighborhood preservation and class-centroid stability across FashionMNIST, Cora, and Paul15. Together, these results show that spectral sparsification preserves not only graph operators, but also the representation geometry that supports downstream use of GNN embeddings for interpretability. Comments: 9 pages, 4 figures Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Spectral Theory (math.SP); Machine Learning (stat.ML) Cite as: arXiv:2605.01136 [cs.LG] (or arXiv:2605.01136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-126] Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather
链接: https://arxiv.org/abs/2605.01126
作者: Amy McGovern,Taylor Mandelbaum,Daniel Rothenberg,Nicholas Loveday,Corey Potvin,Montgomery Flora,Linus Magnusson,Eric Gilleland,John Allen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Forecasting the wide variety of high-impact weather events experienced globally is a challenge for both Artificial Intelligence (AI) and Numerical Weather Prediction (NWP) models and it is critical that such models be properly verified before deployment. Although AI weather models are rapidly evolving, much of their evaluation is currently done either with a global-scale evaluation or by hand-picking a small number of case studies or a region. A widely-used open-source benchmark suite focusing on high-impact weather will help to drive the science forward for all scales of weather models, as it has for other AI fields. Here we introduce Extreme Weather Bench (EWB), a new community-driven benchmark suite that facilitates model validation and verification on a variety of high-impact hazards that matter to people around the globe. EWB provides a standard set of case studies (spanning across multiple spatial and temporal scales and different parts of the weather spectrum), observational data, impact-based metrics, and open-source code for users to evaluate their models. Verifying that a model works against a standard set of case studies, especially events that are high-impact for the general public, is a key piece of improving the trustworthiness of AI models. EWB will help to drive the science forward for all weather models, enabling true comparisons across models and evaluating models on specific high-impact phenomena through the use of case studies. EWB is a free open-source community-driven system and will continue to evolve to include additional phenomena, test cases and metrics in collaboration with the worldwide weather and forecast verification community.
[LG-127] Machine Learning-Augmented Acceleration of Iterative Ptychographic Reconstruction
链接: https://arxiv.org/abs/2605.01122
作者: Bowen Zheng,Katayun Kamdin,David Shapiro,Alexander Ditter,Dayne Sasaki,Emma Bernard,Roopali Kukreja,Petrus H. Zwart,Slavomír Nemšák,Apurva Mehta,Nicholas Schwarz,Alexander Hexemer,Tanny Chavez
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:
Abstract:Iterative ptychographic reconstruction algorithms are widely used for coherent diffractive imaging but can exhibit slow convergence under realistic experimental conditions. We propose a machine learning-augmented approach that accelerates iterative ptychographic reconstruction by introducing a learned fast-forward operator applied during reconstruction. Following an initial warm-up using standard iterations, the fast-forward operator advances the reconstruction toward a more converged state, after which conventional iterative updates are resumed. This strategy preserves the physical consistency and flexibility of established ptychographic solvers while reducing the number of iterations required for convergence. The model is trained on diverse ptychographic datasets and evaluated on experimental data acquired in a different year, demonstrating robustness and temporal generalization. Compared with conventional iterative solvers, the machine learning-augmented method achieves comparable reconstruction quality while converging faster in terms of Poisson negative log-likelihood, yielding over a two-fold reduction in wall-clock time. The approach has been integrated into an existing reconstruction pipeline and deployed in production at a synchrotron beamline, demonstrating practicality for real-time experimental operation.
[LG-128] opological Neural Tangent Kernel
链接: https://arxiv.org/abs/2605.01110
作者: Sanjukta Krishnagopal
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 9 pages 4 figures
Abstract:Graph neural tangent kernels give a principled infinite-width theory for graph neural networks, but inherit a basic limitation of graph models: they see only pairwise structure. Many relational systems contain higher-order interactions that are more naturally represented by simplicial complexes. We introduce the Topological Neural Tangent Kernel (TopoNTK), an infinite-width kernel for simplicial message passing on edge features. TopoNTK combines lower Hodge interactions, capturing graph-like coupling through shared vertices, with upper Hodge interactions, capturing coupling through filled simplices. This makes the kernel sensitive to topology invisible to graph kernels, allowing complexes with the same graph but different filled simplices to induce different kernels. Beyond expressivity, the Hodge structure gives the kernel an interpretable learning geometry. Edge signals decompose into gradient-like, harmonic, and local circulation components, and the spectrum of the TopoNTK determines how quickly each component is learned. This yields a topological form of spectral bias: components aligned with large-eigenvalue modes are learned quickly, while global harmonic modes, retained through the residual channel, often lie at smaller eigenvalues and are learned more slowly. We prove expressivity, Hodge-alignment, spectral learning, and stability properties, and validate them on synthetic simplicial tasks and DBLP higher-order link prediction. The results show that topology is not merely extra structure; it can provide coordinates that make relational learning more faithful, interpretable, and effective.
[LG-129] Diffusion Operator Geometry of Feedforward Representations
链接: https://arxiv.org/abs/2605.01107
作者: Kanishka Reddy
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:
Abstract:Neural networks transform data through learned representations whose geometry affects separation, contraction, and generalization. Recent work studies this geometry using discrete curvature on neighborhood graphs, suggesting Ricci-flow-like behavior across layers. We develop a smooth operator-theoretic alternative for feedforward representation snapshots. Each feature cloud induces a Gaussian-kernel diffusion Markov operator, and transport, spectral, label-boundary, and local-scale observables are derived from this single object via Bakry-Emery \Gamma -calculus. In a balanced Gaussian class-conditional snapshot model with shared covariance, the population operator has closed-form class affinities, leakage, and coarse spectra, all controlled by pairwise regularized Mahalanobis separations c_\varepsilon^(a,b) . We also prove that the resulting operator observables vary smoothly under feature perturbations, while hard neighborhood-graph diagnostics can change discontinuously. Synthetic experiments validate the closed-form Gaussian bridge, while learned MNIST experiments show that the same operator observables track training, width, and perturbation stability. Together, these results give a stable operator-geometric framework for analyzing feedforward representation geometry.
[LG-130] Learning to Race in Minutes: Infoprop Dyna on the Mini Wheelbot
链接: https://arxiv.org/abs/2605.01096
作者: Devdutt Subhasish,Henrik Hose,Sebastian Trimpe
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Originally submitted to the German Robotics Conference, 2026
Abstract:Reinforcement Learning (RL) has the potential to enable robots with fast, nonlinear, and unstable dynamics to reach the limits of their performance. However, most recent advances rely on carefully designed physics-based simulators and domain randomization to achieve successful sim-to-real transfer within reasonable wall-clock time. In this work, we bypass the need for such simulators and demonstrate that Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning (MBRL) framework, can enable robots to learn directly from real-world interactions. Using Infoprop Dyna, the Mini Wheelbot, an underactuated unicycle robot, learns to race around a track within 11 minutes of real-world experience.
[LG-131] Learning Discriminators for Resampling in the Ensemble Gaussian Mixture Filter through a Normalizing Flow Approach
链接: https://arxiv.org/abs/2605.01089
作者: Zain Jabbar,Andrey A. Popov
类目: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注:
Abstract:The ensemble Gaussian mixture filter (EnGMF) is a powerful, convergent particle filter capable of medium-to-high dimensional non-linear filtering. The EnGMF relies on a resampling step that can generate physically unrealistic posterior samples, that would subsequently produce physically meaningless forecasts. This work introduces the discriminator-informed resampling procedure, that augments the posterior resampling step with a discriminator that accepts or rejects candidate particles based on their physical plausibility. In this work these discriminators are learned through a normalizing flow approach. Numerical experiments on both the Ikeda map and the Lorenz '63 system show that discriminator informed resampling procedure consistently reduces error relative to the standard EnGMF in low-ensemble regimes.
[LG-132] Networked Information Aggregation for Binary Classification ICML2026
链接: https://arxiv.org/abs/2605.01082
作者: MohammadHossein Bateni,Zahra Hadizadeh,MohammadTaghi Hajiaghayi,Mahdi JafariRaviz,Shayan Taherijam
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:We study networked binary classification on a directed acyclic graph (DAG) where each agent observes only a subset of the feature columns of a shared dataset. Agents act sequentially along the DAG: each receives prediction columns from its parents (if any), augments its local features with these columns, fits a logistic predictor by minimizing binary cross-entropy (BCE), and forwards its prediction column to its outgoing neighbors. We ask whether this sequential distributed training procedure achieves information aggregation, meaning that some agent attains small excess loss compared to the best logistic predictor trained with access to all feature columns. This question was studied for linear regression under squared loss by Kearns, Roth, and Ryu (SODA 2026). Extending their guarantees to classification is nontrivial because their analysis relies on quadratic structure that does not directly transfer to BCE with a logistic link. We analyze the resulting sequential logit-passing protocol and prove: (i) an excess loss upper bound of O(M/\sqrtD) on depth- D paths under the condition that every M contiguous subsequence of M agents collectively observe all features, and (ii) a close lower bound showing instances with excess loss of at least \Omega(k/D) where k is the dimension of the feature space. Together, these results identify network depth as a fundamental bottleneck for information aggregation in networked logistic regression. Comments: Accepted to the 43rd International Conference on Machine Learning (ICML 2026) Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH) Cite as: arXiv:2605.01082 [cs.LG] (or arXiv:2605.01082v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.01082 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-133] Benchmarking local Hebbian learning rules for memory storag e and prototype extraction
链接: https://arxiv.org/abs/2605.01074
作者: Anders Lansner,Andreas Knoblauch,Naresh B Ravichandran,Pawel Herman
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 31 pages, 9 + 2 suppl figures, 5 tables
Abstract:Associative memory or content-addressable memory is an important component function in computer science and information processing, and at the same time a key concept in cognitive and computational brain science. Many different neural network architectures and learning rules have been proposed to model the brain’s associative memory while investigating key component functions like figure-ground segmentation, perceptual reconstruction and rivalry. A less investigated but equally important capability of associative memory is prototype extraction where the training set comprises distorted prototype instances and the task is to recall the correct generating prototype given a new distorted instance. In this paper we benchmark associative memory function of seven different Hebbian learning rules employed in non-modular and modular recurrent networks with winner-take-all dynamics operating on moderately sparse binary patterns. We measure pattern storage and weight information capacity, prototype extraction capabilities, and sensitivity to correlations in data. The original additive Hebb rule comes out with worst capacity, covariance learning proves to be robust but with moderate capacity, and the Bayesian-Hebbian learning rules show highest capacity in almost all different conditions tested.
[LG-134] Deep Variational Inference Symbolic Regression ICLR2026
链接: https://arxiv.org/abs/2605.01067
作者: James Butterworth,Gevik Grigorian,Alejandro DiazDelaO
类目: Machine Learning (cs.LG)
*备注: Code: this https URL
Abstract:Symbolic regression discovers explicit, interpretable equations without assuming a functional form in advance. A Bayesian approach strengthens this through probability distributions over candidate expressions, thus quantifying uncertainty in the presence of noisy and limited data. Deep Symbolic Regression (DSR) uses a neural network to generate symbolic expressions, but it is designed to identify a single best-fitting expression rather than infer a posterior distribution over models. We introduce Deep Variational Inference Symbolic Regression (DVISR), a variational Bayesian extension of DSR. DVISR replaces the original reward with the integrand of the evidence lower bound. It also extends the network architecture to output distributions over constants within expressions, enabling posterior inference over both expression trees and their associated constants. We show that DVISR can recover the true posterior in simple settings, both with and without constant tokens, and we examine how its performance changes as the size of the expression space increases. These results position DVISR as a step toward scalable Bayesian symbolic regression with uncertainty over full symbolic models.
[LG-135] A dimensional R2 regression metric
链接: https://arxiv.org/abs/2605.01066
作者: Jaesung Yoo,Stefan Lemke,Jian Zhong Guo,Kanaka Rajan,Adam Hantman
类目: Machine Learning (cs.LG)
*备注:
Abstract:R2 score is the standard metric for evaluating regression tasks, offering a normalized magnitude-agnostic measure of accuracy that captures variance. However, R2 has three key limitations: it is limited to at most two dimensional inputs, it reduces the score to a single scalar that hides rich patterns of prediction accuracy, and it is sensitive to low-variance noise channels which can yield large, uninterpretable negative values. We introduce the Dimensional R2 score (Dim-R2), a simple extension of R2 that accepts data of arbitrary dimensionality, provides a multidimensional view of accuracy, and reduces sensitivity to noise. We demonstrate its advantages on both synthetic sinusoidal data and three multidimensional regression datasets. Dim-R2 offers an interpretable and flexible metric that highlights patterns in regression accuracy, guiding regression modeling.
[LG-136] SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
链接: https://arxiv.org/abs/2605.01060
作者: Shashank Kapadia,Deep Narayan Mishra,Sujal Reddy Alugubelli,Ajay Kumar,Swapnil Yadav,Rishi Bhatia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 15 pages, 10 figures, 11 tables
Abstract:We present SURGE, a streaming GPU encoding system deployed in production to generate embeddings for over 800 million texts across 40,000 logical partitions. Production embedding pipelines face a tension between logical data partitioning and efficient GPU utilization: processing each partition independently incurs P inter-process communication (IPC) calls whose overhead limits throughput for compute-light models. Our contributions are analytical: (i) a cost model (Theorem 1) predicting throughput within 2% across three encoders spanning a 15 \times parameter range; (ii) a memory-safety bound (Lemma 3) enabling a streaming two-threshold policy with peak memory O(B_\min + n_\max) rather than O(N) ; and (iii) a \phi /CV decision framework characterizing when the pattern applies beyond our workload. The naive fix of batching at fixed size requires O(N) peak memory (32.7 GB at 10M texts; infeasible beyond ~60M on 192 GB nodes), produces no output until all encoding completes, and offers no fault tolerance. SURGE achieves the same throughput with O(B_\min + n_\max) bounded memory (2.6 GB), 68 \times faster time-to-first-output, and crash recovery at SuperBatch granularity. On 10M texts with 4 NVIDIA L4 GPUs, SURGE delivers 26,413 texts/s – matching fixed-batch throughput while using 12.6 \times less memory. We validate on bge-base (109M, d =768, error 1.3%) and across log-normal \sigma in 1.0, 1.72, 2.5 (speedup invariant within \pm 3%), and compare against a partition-batched baseline (PB-PBP-LB), against which SURGE retains a 7% throughput edge and 2.5 \times faster TTFO. Complementary engineering – zero-copy Arrow serialization (22-25 \times speedup) and async I/O pipelining (up to 93% benefit) – realizes the design but is not the contribution.
[LG-137] Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning
链接: https://arxiv.org/abs/2605.01046
作者: Zhi-Quan Feng,Ying-Jia Lin,Hung-Yu Kao
类目: Machine Learning (cs.LG)
*备注:
Abstract:LoRA adapts large language models (LLMs) by restricting updates to low-rank subspaces of pre-trained weights. While this substantially reduces training cost, the effectiveness of adaptation critically depends on which subspace is chosen at initialization: a poor initialization that allocates capacity to task-irrelevant directions can severely hinder downstream performance. Existing initialization strategies primarily rely on the intrinsic properties of pre-trained weights, implicitly assuming that weight geometry alone reflects task relevance. However, such criteria overlook how the model interacts with the downstream data distribution. In this work, we formulate LoRA initialization as identifying the degree of impact of directions in parameter space under the target data distribution. We argue that data-aware sensitivity, rather than weight-only magnitude, should govern the choice of adaptation subspaces. Building on this perspective, we propose a Fisher-guided framework that leverages curvature information induced by downstream data to characterize how parameter perturbations influence model predictions. This perspective yields a principled, task-dependent criterion for selecting LoRA directions that better align adaptation with the target objective. Empirical results across diverse tasks and modalities demonstrate that data-aware initialization consistently and significantly improves downstream performance over existing approaches.
[LG-138] Differentiable Multiphysics Co-Optimization via Implicit Neural Representations: A Transient Hamburger-Cooking Benchmark
链接: https://arxiv.org/abs/2605.01040
作者: Navid Zobeiry
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Preprint. 24 pages, 5 figures
Abstract:The co-optimization of geometry and physical parameters remains challenging in transient multiphysics systems involving moving boundaries, nonlinear material response, phase transitions, and competing objectives. Existing methods often optimize geometry and physical variables separately, rely on simplified steady-state physics, or require offline data generation and reduced design spaces. Here, we present an end-to-end differentiable co-optimization framework that couples an implicit neural representation of geometry with a JAX-compiled Eulerian multiphysics solver. Geometry is represented as a signed distance field using Fourier-feature-encoded spatial coordinates, while boundary conditions, initial conditions, process controls, and material parameters are optimized within the same differentiable loop. Continuous relaxations represent non-smooth physical transitions while preserving compatibility with reverse-mode automatic differentiation and backpropagation through time. We demonstrate the framework using a transient hamburger-cooking benchmark, selected as an interpretable multiphysics problem rather than a culinary optimization exercise. The benchmark combines conductive and convective heat transfer, latent energy effects, moisture and fat transport, shrinkage-induced geometry evolution, evolving contact boundary conditions, flipping-induced boundary-condition changes, and competing quality objectives. Results show that geometry-only optimization modifies shape to relieve thermal bottlenecks, while joint co-optimization distributes the design response across geometry, material state, process variables, and boundary conditions through gradients propagated over the full transient rollout.
[LG-139] Finite-Sample Analysis of Elimination in Active Hypothesis Testing
链接: https://arxiv.org/abs/2605.01039
作者: Ziyuan Lin,Hoang Ngoc Nguyen,Jie Xu,Ivan Ruchkin
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Conference on Decision and Control (CDC) 2026. 18 pages, 4 figures
Abstract:A fixed-confidence, finite-sample problem of active hypothesis testing arises in many safety-critical applications. Situated in the context of sequential hypothesis testing, this paper studies the effect of hypothesis elimination on the stopping time. We introduce an elimination-augmented Track-and-Stop algorithm, in which champion-specific active-opponent sets are progressively pruned, and sensing effort is reallocated toward the surviving alternatives. Our analysis derives a non-asymptotic upper bound on the expected stopping time. The gain in finite-sample from elimination appears on the scale of the non-leading term, resulting from tighter tracking and concentration constants on the reduced hypothesis set. Furthermore, we introduce an aggressiveness parameter to modulate the trade-off between faster elimination and weaker confidence guarantee. An experimental study on synthetic Gaussian instances confirms the theoretical predictions.
[LG-140] Continual Learning of Feedback-based Molecular Communication
链接: https://arxiv.org/abs/2605.01020
作者: Siddhant Setia,Junichi Suzuki,Tadashi Nakano
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures. To be published in Proceedings of International Conference on Bio-inspired Information and Communications Technologies 2025
Abstract:This paper proposes and evaluates a new performance estimation method that leverages continual learning (CL) algorithms to carry out sequential simulation experiments for a feedback-based molecular communication protocol. As the protocol is sequentially examined in various experimental settings, the proposed CL-based performance estimators incrementally learn a series of unexperienced estimation tasks without compromising those that have been learned in the past. They are designed to work on a standard neural network architecture by customizing regularization and replay strategies in the loss function. Experimental results demonstrate that the proposed estimators can effectively learn on a continuous stream of simulation results and enhance the baseline neural network by improving estimation accuracy at a variety of computational costs. This paper’s contribution is to establish the implications of CL in the field of molecular communication.
[LG-141] Robust volatility updates for Hierarchical Gaussian Filtering
链接: https://arxiv.org/abs/2605.00966
作者: Christoph Mathys,Nicolas Legrand,Peter Thestrup Waade,Nace Mikus,Lilian Aline Weber
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:
Abstract:Hierarchical Gaussian Filtering (HGF) networks allow for efficient updating of posterior distributions (beliefs) about hidden states of an agent’s environment. HGF parent nodes can target the mean or variance of their children. New information entering at input nodes leads to a cascade of belief updates across the network according to one-step update equations for each node’s mean and precision (inverse variance). However, the original form of the update equations for variance-targeting parents(volatility coupling) can in some regions of parameter space lead to negative posterior precision, a logical impossibility which causes the updating algorithm to terminate with an error. In this report, we introduce a modified quadratic approximation to the variational energy of volatility-coupled nodes that avoids negative posterior precision. The key idea is to interpolate between two quadratic expansions of the variational energy: one at the prior prediction and one at a second mode whose location is obtained in closed form via the Lambert W function. The resulting update equations are robust across the entire parameter space and faithfully track the variational posterior even for large prediction errors.
[LG-142] PPO guided Agent ic Pipeline for Adaptive Prompt Selection and Test Case Generation
链接: https://arxiv.org/abs/2605.00942
作者: Gourisetty Venkata Sai Koushik,Dama Aditya,Mahankali Harish Sai,Peddi Siddarhta,Shadab Ahmad,Vivek Yelleti
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Developing effective test cases capable of thoroughly exercising large-scale software systems is inherently difficult, especially if such systems have voluminous, complex, and deeply nested source codes. In this work, we present a novel approach for generating test cases using a reinforcement learning-driven agentic framework where Proximal Policy Optimization (PPO) is coupled with an LLM engine to guide prompt selection during test generation. Our approach consists of two phases. In Phase I, the ToT-guided optimization agent partitions and minimizes the source code by removing redundancies without changing the functional behavior of the source code. In Phase II, a PPO-based policy network is trained to solve the problem of selecting prompts among eight different prompting techniques, such as Boundary Value Analysis, Random Fuzzing, etc., based on the inputted 11-dimensional state vector representing the source code complexity metrics and live coverage metrics to direct the LLM engine towards exploring unvisited paths in the program. The PPO agent receives rewards based on a combination of increases in line and branch coverages, penalties for unexplored branches, and rewards for reducing source code length. From experiments conducted on twenty benchmark programs, it is evident that the proposed approach, PPO-LLM, outperforms CBMC, kS-LLM, and kS-LLM++ in terms of branch and line coverage in almost all cases, for various loop bound values ranging from BOUND~1 to BOUND~2000. While at BOUND~1, the coverage of branches is 100% using PPO-LLM on the PALS suite, in comparison, it is around 86.8% using kS-LLM++. This confirms that adaptive prompt selection driven by PPO substantially outperforms static prompting strategies on PALS type programs.
[LG-143] Structured Analytic Coherent Point Drift for Non-Rigid Point Set Registration
链接: https://arxiv.org/abs/2605.00934
作者: Wei Feng,Haiyong Zheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Analytic-CPD, a structured analytic variant of coherent point drift for non-rigid point set registration. The method retains the CPD posterior correspondence layer, but replaces the point-indexed Gaussian-kernel displacement-field M-step with a finite-dimensional structured analytic mapping estimator. Posterior probabilities from the Gaussian mixture model are condensed through a barycentric identity into weighted soft target points, converting the CPD pairwise soft-correspondence objective into a weighted analytic fitting problem. The deformation is represented by a truncated multivariate Taylor mapping of a vector-valued function, so the number of deformation parameters is controlled by the ambient dimension and the analytic order rather than by an M-by-M kernel system over the moving points. A degree-continuation strategy is further introduced to stabilize large-deformation registration by progressively activating higher-order analytic modes. Experiments on two-dimensional analytic deformations and three-dimensional smooth non-analytic deformations show that Analytic-CPD achieves lower final errors and faster convergence than standard CPD in representative large-deformation settings. The results suggest that CPD-style probabilistic correspondences and structured analytic mappings provide a compact and interpretable alternative to kernel-based non-rigid registration. Code is available at this https URL.
[LG-144] Hierarchical Federated Learning for Networked AI: From Communication Saving to Architecture-Aware Design
链接: https://arxiv.org/abs/2605.00931
作者: Seyed Mohammad Azimi-Abarghouyi,Mehdi Bennis,Leandros Tassiulas
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:
Abstract:Federated learning (FL) is fundamentally a distributed optimization problem executed by communicating agents with local data, local computation, and partial system visibility. Once FL is viewed through that lens, hierarchy is not merely a scalability mechanism. It becomes the natural place to rethink how distributed optimization should be organized over real multi-tier networks. This article argues that hierarchical federated learning (HFL) should move beyond its common framing as a communication-saving protocol and instead be viewed as an architecture-aware design framework for networked AI. The framework is organized around three coupled design axes: architectural parameters, layer-wise optimization decomposition, and layer-wise communication realization. The first axis determines the coordination geometry of learning through hierarchy depth, layer asymmetry, and layered connectivity. The second determines how the global FL objective is decomposed across layers and highlights modular multi-layer optimization as a major opportunity beyond one dominant method everywhere. The third determines how the distributed optimization is physically realized under heterogeneous communication regimes, from interference-limited lower tiers to reliable upper tiers. A central message is that, in HFL, convergence becomes architecture-dependent: it is directly shaped by the chosen hierarchy, the assigned optimization roles, and the communication mechanisms that connect them. We develop this viewpoint using large-scale wireless edge intelligence as a flagship networked AI setting, then provide a comparative perspective on flat FL, two-tier HFL, and deep HFL together with a regime-oriented design map. The resulting perspective positions HFL as a practical methodology for designing future networked AI systems.
[LG-145] A Review of the Receiver Operating Characteristic Curve and a Proof About the Area Beneath It
链接: https://arxiv.org/abs/2605.00926
作者: Steven Redolfi
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:The Receiver Operating Characteristic (ROC) curve of a binary classifier has often been utilized to measure the performance of the classifier. The area beneath this curve is used in particular because of its quoted probabilistic interpretation as being equal to the probability that the classifier will rank a random positive observation above a random negative observation. This paper formalizes this claim, produces a bound on how far away from the truth it is if a hypothesis is not met, and gives a small literature review of the ROC curve.
[LG-146] Adaptive Alarm Threshold Prediction in 4G Mobile Networks: A Percentile-Guided Deep Learning Framework with Interpretable Outputs
链接: https://arxiv.org/abs/2605.00838
作者: Ayon Roy,Sadman Sharif,Shiva Prasad Sarkar
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 21 pages, 8 figures, preprint
Abstract:In mobile telecommunications, alarms act as early warning signals. They are triggered when a cell, the basic unit of radio coverage, shuts down or behaves abnormally. This signals a degradation in service quality, which directly affects the customer experience. To fix the issue, operators rely on preset thresholds to decide when an engineer should be sent out. In practice, these thresholds are set manually and remain fixed regardless of the time of day, traffic levels, or overall network conditions. This often leads to serious faults slipping through during busy hours, while minor issues can cause unnecessary callouts when the network is quiet. This paper presents a machine learning framework that automatically predicts four alarm thresholds, audit window duration, inactive time limit, total fluctuation count, and per hour fluctuation limit, from live network behavior. Since no ground truth labels exist for thresholds, we introduce a percentile guided label derivation strategy and evaluate four models on an anonymized dataset of 10,648 cells across three vendors and nine regions from a real 4G network, comprising a Gradient Boosted Trees baseline, a CNN-BiLSTM with attention, the proposed PCTN, and an iTransformer. PCTN performs the best overall with respect to three of the four targets, outperforming a state-of-the-art iTransformer while using 83 percent fewer parameters. Its mixed output heads and dynamic alpha mechanism produce thresholds that are both accurate and interpretable, allowing operators to inspect and adjust the learned policy without retraining. All comparisons are statistically significant at p 0.001. The framework undergoes daily retraining using new data, which enables the thresholds to constantly adjust to changes in the network.
[LG-147] Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions FAST
链接: https://arxiv.org/abs/2605.00837
作者: Hao Xiao
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, code at this https URL
Abstract:Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^-4 where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large-scale optimal transport computation.
[LG-148] From Euler to Dormand-Prince: ODE Solvers for Flow Matching Generative Models
链接: https://arxiv.org/abs/2605.00836
作者: Hao Xiao
类目: Machine Learning (cs.LG)
*备注: 14 pages, 10 figures, code at this http URL
Abstract:Sampling from Flow Matching generative models requires solving an ordinary differential equation (ODE) whose computational cost is dominated by neural network forward passes. We derive four classical ODE solvers – Euler, Explicit Midpoint, Classical Runge-Kutta (RK4), and Dormand-Prince 5(4) – from first principles via Taylor expansion, implement them from scratch in PyTorch, and systematically benchmark their efficiency on Conditional Flow Matching tasks ranging from 2D toy distributions to MNIST digits. On the quantitative side, we use sliced Wasserstein distance to construct NFE-quality Pareto frontiers,finding that RK4 at 80 function evaluations achieves sample quality comparable to Euler at 200. Beyond reproducing known convergence rates, we report two empirical observations: (1) the Jacobian eigenvalue spectrum of the learned velocity field stiffens sharply near t=1, explaining why the adaptive Dormand-Prince solver automatically concentrates its step budget at the end of the trajectory; (2) the quality gap between low-order and high-order solvers widens for undertrained and smaller models, indicating that solver choice matters most when the model is imperfect. Code and all experiment scripts are publicly available.
[LG-149] Sparse Regression under Correlation and Weak Signals: A Reproducible Benchmark of Classical and Bayesian Methods
链接: https://arxiv.org/abs/2605.00835
作者: Hao Xiao
类目: Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, 6 tables. Code: this https URL
Abstract:Choosing between classical and Bayesian sparse regression methods involves a real trade-off: penalized estimators like Lasso run in milliseconds but give no uncertainty estimates,while Horseshoe and Spike-and-Slab priors produce full posteriors but need MCMC chains that take minutes per this http URL few studies compare these two families head-to-head under the conditions that actually make sparse regression hard – correlated features, weak signals, and growing dimensionality. We benchmark six methods (OLS, Ridge,Lasso, Elastic Net, Horseshoe, Spike-and-Slab) on synthetic data with three covariance structures (rho up to 0.9), four SNR levels, and p in 20, 50, 100, plus the Diabetes dataset,totalling over 2,600 experiments. The results are clear on some points and nuanced on others. Bayesian methods win on prediction error (MSE 72 vs. 108-267), and the Horseshoe delivers near-nominal 95% coverage (94.8%). But Spike-and-Slab,despite narrower intervals, under-covers at 91.9% – its continuous relaxation likely plays a role. For variable selection, Lasso and Spike-and-Slab tie at F1 ~ 0.47, making Lasso the practical default when posteriors are not needed. Code and data are available at this https URL.
[LG-150] Polynomial-Time Optimal Group Selection via the Double-Commutator Eigenvalue Problem
链接: https://arxiv.org/abs/2605.00834
作者: Mitchell A. Thornton
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Information Theory (cs.IT)
*备注:
Abstract:The algebraic diversity framework replaces temporal averaging over multiple observations with algebraic group action on a single observation for second-order statistical estimation. The central open problem in this framework is \textitgroup selection : given an M -dimensional observation with unknown covariance structure, find the finite group whose spectral decomposition best matches the covariance. Naive enumeration of all subgroups of the symmetric group S_M requires exponential time in M . We prove that this combinatorial problem reduces to a generalized eigenvalue problem derived from the double commutator of the covariance matrix, yielding a polynomial-time algorithm with complexity O(d^2M^2 + d^3) , where d is the dimension of a generator basis. The minimum eigenvector of the double-commutator matrix directly constructs the optimal group generator in closed form, with no iterative optimization. The reduction is exact: the double-commutator minimum eigenvalue is zero if and only if the optimal generator lies in the span of the basis, and its magnitude provides a certifiable optimality gap when it does not. This problem does not appear in the standard catalogs of computational complexity (Garey and Johnson, 1979) and represents a new class linking group theory, matrix analysis, and statistical estimation. We establish connections to independent component analysis (JADE), structured matrix nearness problems, and simultaneous matrix diagonalization, and we show that the double-commutator formulation is the unique approach that is simultaneously polynomial-time, closed-form, and certifiable.
[LG-151] Multi-fidelity surrogates for mechanics of composites: from co-kriging to multi-fidelity neural networks
链接: https://arxiv.org/abs/2605.02871
作者: Haizhou Wen,Elham Kiyani,Gang Li,Srikanth Pilla,George Em Karniadakis,Zhen Li
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 64 pages, 18 figures. Submitted to Composites Part B: Engineering
Abstract:Composite materials exhibit strongly hierarchical and anisotropic properties governed by coupled mechanisms spanning constituents, plies, laminates, structures, and manufacturing history. This intrinsic complexity makes predictive modeling of composites expensive, because repeated experiments and high-fidelity simulations are needed to cover large design spaces of material, structure, and manufacturing. Multi-fidelity surrogate modeling addresses this challenge by combining abundant, less expensive data with limited high-accuracy data to recover reliable high-fidelity predictions. This review presents a structured overview of multi-fidelity modeling for composite mechanics, covering Gaussian-process or Kriging-based methods, including co-Kriging, coregionalization models, autoregressive formulations, nonlinear autoregressive Gaussian processes, multi-fidelity deep Gaussian processes, and multi-fidelity neural networks. Their distinctions are examined in terms of cross-fidelity correlation, discrepancy representation, uncertainty quantification, and scalability. Selected examples of their applications to composites are introduced according to the roles that multi-fidelity surrogates play in engineering problems, including forward prediction for rapid exploration of material design spaces, inverse optimization for composite parameter identification and design search under limited high-fidelity access, and workflow integration, where heterogeneous data sources, constraints, and validation requirements determine model utility. Open question discussions highlight recurring challenges specific to composites, such as regime-dependent fidelity gaps associated with nonlinear damage and manufacturing history, mismatches between simulations and experiments, and uncertainty propagation across multi-fidelity models.
[LG-152] Universality in Deep Neural Networks: An approach via the Lindeberg exchange principle
链接: https://arxiv.org/abs/2605.02771
作者: Filippo Giovagnini,Sotirios Kotitsas,Marco Romito
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 2 figures
Abstract:We consider the infinite-width limit of a fully connected deep neural network with general weights, and we prove quantitative general bounds on the 2 -Wasserstein distance between the network and its infinite-width Gaussian limit, under appropriate regularity assumptions on the activation function. Our main tool is a Lindeberg principle for Deep Neural Networks, which we use to successively replace the weights on each layer by Gaussian random variables.
[LG-153] Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models INTERSPEECH2026
链接: https://arxiv.org/abs/2605.02715
作者: Sandra Arcos-Holzinger,Sarah M. Erfani,James Bailey,Sanjeev Khudanpur
类目: Audio and Speech Processing (eess.AS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to Interspeech 2026
Abstract:Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation? To address this, we present GRIDS, a framework using Local Intrinsic Dimensionality (LID) across layer-wise representations in WavLM and wav2vec 2.0. We find that LID increases for all low signal-to noise ratio (SNR) perturbations and diverges at high SNR: benign noise converges toward the clean profile, while adversarial inputs retain early-layer LID elevation. We show LID elevation co-occurs with increased WER, and that layer-wise LID features enable anomaly detection (AUROC 0.78-1.00), opening the door to transcript-free monitoring in S3Ms.
[LG-154] Robust and Fast Training via Per-Sample Clipping
链接: https://arxiv.org/abs/2605.02701
作者: Davide Nobile,Philipp Grohs
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.
[LG-155] Random-Effects Algorithm for Random Objects in Metric Spaces
链接: https://arxiv.org/abs/2605.02693
作者: Marcos Matabuena,Mateo Cámara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Across many scientific disciplines, multiple observations are collected from the same experimental units, and in modern datasets these observations often arise as non-Euclidean random objects. In such settings, the incorporation of random effects is a critical modeling step for efficient estimation and personalized prediction. Although mixed-effects models are well established for scalar outcomes and, more recently, for functional data in Hilbert spaces, general random-effects frameworks for objects in metric spaces remain underdeveloped. In this paper, we propose a nonlinear Fréchet-based algorithm for random-effects modeling of arbitrary random objects defined on a metric space. Using M-estimation theory, we establish conditions under which the proposed metric-space prediction target is consistently estimated under a working random-effects formulation. We then evaluate the empirical performance of the proposed method using both synthetic data and digital health datasets that require practical tools for analyzing random objects in metric spaces, such as multivariate probability distributions and random graphs. We show that, although our method is developed beyond Hilbert spaces, it can outperform existing Hilbert space-based methods.
[LG-156] ParaRNN: An Interpretable and Parallelizable Recurrent Neural Network for Time-Dependent Data
链接: https://arxiv.org/abs/2605.02692
作者: Yuxi Cai,Lan Li,Feiqing Huang,Guodong Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of large-scale and structurally complex data has spurred the integration of machine learning methods into statistical modeling. Recurrent neural networks (RNNs), a foundational class of models for time-dependent data, can be viewed as nonlinear extensions of classical autoregressive moving average models. Despite their flexibility and empirical success in machine learning, RNNs often suffer from limited interpretability and slow training, which hinders their use in statistics. This paper proposes the Parallelized RNN (ParaRNN), a novel model composed of multiple small recurrent units. ParaRNN admits an additive representation that decouples recurrent dynamics into interpretable components, whose behavior can be characterized through recurrence features. This interpretability enables its applications in nonparametric regression for time-dependent data, while the design also allows efficient parallelization. The approximation capacity and non-asymptotic prediction error bounds in a nonparametric regression setting are established for ParaRNN. Empirical results on three sequential modeling tasks further demonstrate that ParaRNN achieves performance comparable to vanilla RNNs while offering improved interpretability and efficiency.
[LG-157] Online Generalised Predictive Coding
链接: https://arxiv.org/abs/2605.02675
作者: Mehran H. Z. Bazargani,Szymon Urbas,Adeel Razi,Thomas Brendan Murphy,Karl Friston
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 45 pages, 17 Figures
Abstract:This paper introduces an extension of generalised filtering for online applications. Generalised filtering refers to data assimilation schemes that jointly infer latent states, learn unknown model parameters, and estimate uncertainty in an integrated framework – e.g., estimate state and observation noise – at the same time (i.e., triple estimation). This framework appears across disciplines under different names, including variational Kalman-Bucy filtering in engineering, generalised predictive coding in neuroscience, and Dynamic Expectation Maximisation (DEM) in time-series analysis. Here, we specialise DEM for ``online’’ data assimilation, through a separation of temporal scales. We describe the variational principles and procedures that allow one to assimilate data in a way that allows for a slow updating of parameters and precisions, which contextualise fast Bayesian belief updating about the dynamic hidden states. Using numerical studies, we demonstrate the validity of online DEM (ODEM) using a non-linear – and potentially chaotic – generative model, to show that the ODEM scheme can track the latent states of the generative process, even when its functional form differs fundamentally from the dynamics of the generative model. Framed from a neuro-mimetic predictive coding perspective, ODEM offers a biologically inspired solution to online inference, learning, and uncertainty estimation in dynamic environments.
[LG-158] RACED: In vivo imaging of extracellular intrinsic diffusivity tortuosity cell size distribution and cell density in human glioma patients
链接: https://arxiv.org/abs/2605.02615
作者: Joshua K. Marchant,Hong-Hsi Lee,Elizabeth R. Gerstner,Susie Y. Huang,Bruce R. Rosen
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 14 pages, 8 figures (main); 2 pages, 4 figures (supplementary). Submitted to Magnetic Resonance in Medicine
Abstract:The lack of analytical models describing diffusion time dependence at intermediate time scales in complex tissue microstructure limits the accurate quantification of extracellular diffusivity and tissue microstructure. We introduce TRACED, a biophysical model that incorporates diffusion time dependence in cell distributions to quantify pathologically-relevant properties in solid tumors. Neural networks were trained on Monte Carlo diffusion simulations using sphere distribution-based geometries to enable the rapid computation of time-dependent diffusion MRI signals in cell populations of variable cell size. Model sensitivity and fit performance were assessed via simulation. Diffusion data from eight mixed-grade glioma patients was fitted using the TRACED model. Data fitting was performed using a novel physics-informed transfer learning pipeline, Sim2PINN. In two patients, cell size measurements were compared directly with image-localized histology. Simulation results indicate improved parameter estimation compared to the simple two-compartment model. TRACED enabled the simultaneous in vivo quantification of intracellular volume fraction, cell size distribution, extracellular intrinsic diffusivity, and tortuosity in glioma patients. Neural network implementations of diffusion time-dependence and tortuosity showed behavior consistent with coarse-graining and effective medium theory, respectively. Future work will explore the clinical utility of TRACED parameters in additional patients.
[LG-159] Black-box optimization of noisy functions with unknown smoothness NEURIPS2015
链接: https://arxiv.org/abs/2605.02462
作者: Jean-Bastien Grill,Michal Valko,Rémi Munos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in Neural Information Processing Systems (NeurIPS 2015)
Abstract:We study the problem of black-box optimization of a function f of any dimension, given function evaluations perturbed by noise. The function is assumed to be locally smooth around one of its global optima, but this smoothness is unknown. Our contribution is an adaptive optimization algorithm, POO or parallel optimistic optimization, that is able to deal with this setting. POO performs almost as well as the best known algorithms requiring the knowledge of the smoothness. Furthermore, POO works for a larger class of functions than what was previously considered, especially for functions that are difficult to optimize, in a very precise sense. We provide a finite-time analysis of POO’s performance, which shows that its error after n evaluations is at most a factor of sqrt(ln n) away from the error of the best known optimization algorithms using the knowledge of the smoothness.
[LG-160] Middle-mile logistics through the lens of goal-conditioned reinforcement learning NEURIPS
链接: https://arxiv.org/abs/2605.02461
作者: Onno Eberhard,Thibaut Cuvelier,Michal Valko,Bruno De Backer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at Neural Information Processing Systems (NeurIPS) 2023 Workshop on Goal-Conditioned Reinforcement Learning
Abstract:Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs from the environment state.
[LG-161] Active multiple matrix completion with adaptive confidence sets AISTATS
链接: https://arxiv.org/abs/2605.02458
作者: Andrea Locatelli,Alexandra Carpentier,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at International Conference on Artificial Intelligence and Statistics (AISTATS) 2019
Abstract:In this work, we formulate a new multi-task active learning setting in which the learner’s goal is to solve multiple matrix completion problems simultaneously. At each round, the learner can choose from which matrix it receives a sample from an entry drawn uniformly at random. Our main practical motivation is market segmentation, where the matrices represent different regions with different preferences of the customers. The challenge in this setting is that each of the matrices can be of a different size and also of a different rank which is unknown. We provide and analyze a new algorithm, MAlocate that is able to adapt to the unknown ranks of the different matrices. We then give a lower-bound showing that our strategy is minimax-optimal and demonstrate its performance with synthetic experiments.
[LG-162] Denoising data using convex relaxations
链接: https://arxiv.org/abs/2605.02327
作者: Charles Fefferman,Aalok Gangopadhyay,Matti Lassas,Jonathan Marty,Hariharan Narayanan
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 38 pages, 6 figures
Abstract:We study the problem of denoising observations (Y_i=X_i+Z_i), where the latent variables (X_i) are sampled from a low-dimensional manifold in (\mathbbR^n) and the noise variables (Z_i) are isotropic Gaussian. We propose a convex-relaxation estimator that first reduces dimension by principal component analysis and then projects the observations onto the convex hull of the projected latent manifold. We construct a statistical oracle that estimates its supporting hyperplanes from empirical Gaussian tail probabilities of the noisy sample. Under a lower-mass condition on the latent distribution, we prove finite-sample guarantees for the oracle and derive error bounds for the resulting denoiser. The analysis combines risk bounds for least-squares projection under convex constraints with entropy bounds for convex hulls. We also verify the assumptions of the framework for a Cryo-Electron Microscopy observation model by establishing suitable covering number and Lipschitz estimates for the associated group action and imaging operators.
[LG-163] Foundations of Riemannian Geometry for Riemannian Optimization: A Monograph with Detailed Derivations
链接: https://arxiv.org/abs/2605.02279
作者: Benyamin Ghojogh
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 143 pages; expository and implementation-oriented monograph with detailed derivations
Abstract:Riemannian geometry provides the fundamental framework for optimization on nonlinear spaces such as matrix manifolds, which arise in machine learning, signal processing, and robotics. While the underlying theory is classical, existing literature often presents results at a high level of abstraction, omitting the detailed coordinate-level derivations required for implementation and algorithm development. This work provides a self-contained and rigorous treatment of the foundations of Riemannian geometry, with a focus on explicit derivations tailored to Riemannian optimization. We systematically develop the key geometric structures – including tangent and cotangent spaces, tensor calculus, metric tensors, Levi-Civita connections, curvature, and geodesics – emphasizing step-by-step derivations in coordinates and matrix form. Building on these foundations, we derive the Riemannian gradient, Hessian, exponential map, and retraction in a form suitable for numerical computation. We further specialize these constructions to important matrix manifolds, including the Stiefel, Grassmann, and SPD (Symmetric Positive Definite) manifolds, providing explicit formulas widely used in optimization and geometric machine learning. This monograph develops a unified and implementation-oriented treatment of Riemannian geometry for optimization on manifolds. Its main contribution is the systematic organization and detailed derivation of classical geometric constructions in forms directly usable for algorithm design and numerical implementation. By connecting coordinate-level differential geometry with matrix-manifold formulas, the monograph bridges the gap between abstract theory and practical computation, and provides a reference for researchers and practitioners working in Riemannian optimization and related fields. Comments: 143 pages; expository and implementation-oriented monograph with detailed derivations Subjects: Differential Geometry (math.DG); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC) MSC classes: 53B20, 53C20, 53C21, 90C30, 65K10 Cite as: arXiv:2605.02279 [math.DG] (or arXiv:2605.02279v1 [math.DG] for this version) https://doi.org/10.48550/arXiv.2605.02279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benyamin Ghojogh [view email] [v1] Mon, 4 May 2026 07:11:11 UTC (6,419 KB)
[LG-164] Measuring Differences between Conditional Distributions using Kernel Embeddings
链接: https://arxiv.org/abs/2605.02260
作者: Peter Moskvichev,Siu Lun Chau,Dino Sejdinovic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Comparing conditional distributions is a fundamental challenge in statistics and machine learning, with applications across a wide range of domains. While proposed methods for measuring discrepancies using kernel embeddings of distributions in a reproducing kernel Hilbert space (RKHS) provide powerful non-parametric techniques, the existing literature remains fragmented and lacks a unified theoretical treatment. This paper addresses this gap by establishing a coherent framework for studying kernel-based methods to measure divergence between conditional distributions through what we refer to as conditional maximum mean discrepancy (CMMD). The CMMD consists of a family of metrics which we call levels, with three special cases each using a different type of RKHS embedding: CMMD _0 (conditional mean operators), CMMD _1 (conditional mean embeddings), and CMMD _2 (joint mean embeddings). We additionally introduce a general level s CMMD, clarifying the required assumptions, and establishing mathematical connections between the levels through the lens of operator-based smoothing. In addition to reviewing previously proposed estimators, we introduce a novel doubly robust estimator for the CMMD that maintains consistency provided at least one of the underlying models is correctly specified. We provide numerical experiments demonstrating that the CMMD effectively captures complex conditional dependencies for statistical testing.
[LG-165] A Parameter-Free First-Order Algorithm for Non-Convex Optimization with tildemkern1mu O(ε-5/3) Global Rate
链接: https://arxiv.org/abs/2605.02127
作者: Sichao Xiong,Sadok Jerad,Coralia Cartis
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We introduce PF-AGD, the first parameter-free, deterministic, accelerated first-order method to achieve O(\epsilon^-5/3\log(1/\epsilon)) oracle complexity bound when minimizing sufficiently smooth, non-convex functions; this is the best-known bound for first-order methods on smooth non-convex objectives. Unlike existing methods possessing this rate that require a priori knowledge of smoothness constants, we use an adaptive backtracking scheme and a gradient-based restart mechanism to estimate local curvature. This yields a practical algorithm that matches best-known theoretical rates. Empirically, PF-AGD outperforms the practical variant of AGD-Until-Guilty (Carmon et al., 2017), as well as other parameter-free variants, and is a viable alternative to nonlinear conjugate gradient methods.
[LG-166] MIRA: A Score for Conditional Distribution Accuracy and Model Comparison
链接: https://arxiv.org/abs/2605.02014
作者: Sammy Sharief,Justine Zeghal,Gabriel Missael Barco,Pablo Lemos,Yashar Hezaveh,Laurence Perreault-Levasseur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted as a Spotlight Paper at the International Conference on Machine Learning 2026
Abstract:We introduce Mira, a sample-based score for assessing the accuracy of a candidate conditional distribution using only joint samples from the true data-generating process. Relying on the principle that distributions coincide if they assign equal probability mass to all regions, we derive an analytic expression for the Mira statistic, whose average defines the Mira score. This formulation further allows us to compute theoretical reference values and uncertainty estimates when the candidate distribution matches the true one. This framework enables model comparison by quantifying the alignment between the conditional distribution of a candidate model and the true data generating process. Consequently, Mira enables Bayesian model comparison through direct posterior validation, bypassing the challenging evidence computation. We demonstrate its effectiveness across several toy problems and Bayesian inference tasks.
[LG-167] Benchmarking Wireless Representations: High-Dimensional vs. Compressed Embeddings for Efficiency and Robustness
链接: https://arxiv.org/abs/2605.02009
作者: Murilo Batista,Shirin Salehi,Saeed Mashdour,Paul Zheng,Rodrigo C. de Lamare,Anke Schmeink
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Submitted to IEEE GLOBECOM 2026
Abstract:Building on recent advances in representation learning for wireless channels, this work investigates the cost-benefit trade-offs of high-dimensional channel embeddings in practical systems. We benchmark multiple wireless representations: high-dimensional learned embeddings from a wireless foundation model, compact autoencoder-based representations with significantly lower dimensionality, and raw data baselines, evaluating their performance across diverse downstream tasks. We then systematically analyze data efficiency, noise robustness, and computational complexity, explicitly characterizing the resource overhead associated with high-dimensional embeddings. Beyond standard tasks such as line-of-sight/non-line-of-sight (LoS/NLoS) classification and beam selection, we introduce power allocation as a new downstream task. Our results reveal clear trade-offs: while high-dimensional embeddings can perform well in few-shot regimes for certain tasks, they incur substantial latency and parameter overhead. In contrast, compressed latent representations learned by autoencoders demonstrate improved noise robustness and more stable performance across tasks, while significantly reducing computational and transmission costs.
[LG-168] Extrapolation in Statistical Learning with Extreme Value Theory
链接: https://arxiv.org/abs/2605.01909
作者: Sebastian Engelke,Nicola Gnecco,Anne Sabourin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Extreme value theory provides rigorous theory and statistical tools for extrapolation in machine learning, particularly in settings where traditional methods struggle due to data scarcity in the tails. A broad range of tasks benefit from these advances, including regression and classification beyond the training data, extreme quantile regression, supervised and unsupervised dimension reduction, generative artificial intelligence and anomaly detection. This review synthesizes recent developments in these fields at the intersection of statistical learning and extreme value theory, with a focus on principled methods based on asymptotically motivated representations of the tail of univariate and multivariate distributions. We consider different theoretical frameworks for both asymptotically dependent and independent data and discuss how they translate into efficient statistical methods for extrapolation to extreme regions. By addressing both theoretical and practical aspects, we offer a comprehensive overview of the state-of-the-art in this quickly evolving field, and identify promising directions for future research.
[LG-169] Adaptive Estimation and Inference in Semi-parametric Heterogeneous Clustered Multitask Learning via Neyman Orthogonality ICML2026
链接: https://arxiv.org/abs/2605.01907
作者: Hanxiao Chen,Debarghya Mukherjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 49 pages, 6 figures. Accepted at ICML 2026
Abstract:We study clustered multitask learning in a semiparametric setting where tasks share a latent cluster structure in their target parameters but exhibit heterogeneous, potentially infinite-dimensional nuisance components. Such heterogeneity poses a major challenge for existing multitask learning methods, which typically rely on aligned feature spaces or homogeneous task structures. To address this challenge, we propose an adaptive fused orthogonal estimator that integrates Neyman-orthogonal losses with data-driven pairwise fusion penalties. Our framework leverages task-specific pilot estimates to calibrate the fusion penalties and combines adaptive aggregation with orthogonalization to mitigate the impact of nuisance-parameter estimation error. Theoretically, we show that the proposed estimator achieves exact recovery of the latent clustering with high probability and attains pooled parametric convergence rates proportional to cluster size. Moreover, we establish asymptotic normality and show that, asymptotically, our estimator matches the performance of an oracle procedure that knows the true clustering in advance. Empirically, we show that the proposed method consistently outperforms strong baselines in various simulation setups. A real-world application to U.S. residential energy consumption demonstrates the effectiveness of our approach in uncovering meaningful regional clustering in electricity price elasticity, showcasing the efficacy of our method.
[LG-170] Stable Blanket with Hidden Variables and Cycles
链接: https://arxiv.org/abs/2605.01856
作者: Hanqing Xiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 40 pages
Abstract:Stabilized regression aims to identify a set of predictors whose conditional relationship with a response variable remains invariant across different environments. Existing graphical characterizations of the stable blanket are mainly developed for structural causal models (SCMs) without hidden variables or causal cycles. However, latent variables and feedback relationships naturally arise in many applications, and they can change both the Markov blanket and the set of predictors that remain stable under interventions. This paper studies stable blankets in graphical causal models with hidden variables, causal cycles, and both features simultaneously. For models with hidden variables, we use acyclic directed mixed graphs (ADMGs) and m -separation to characterize the Markov blanket and to construct intervention-stable predictor sets. We introduce the notion of an intervened sub-district and use it to describe how interventions may affect districts connected to the response. For models with cycles, we work with directed graphs (DGs) and directed mixed graphs (DMGs) together with \sigma -separation, treating strongly connected components (SCCs) as the basic graphical units. We then combine these ideas to analyze models with both hidden variables and cycles. The main results give graphical characterizations of Markov blankets, stable frontiers, and stable blankets in these generalized settings. In particular, we identify conditions under which the response is conditionally independent of intervention variables given a suitable predictor set, and we describe when such sets are minimal or unique. These results extend the graphical interpretation of stabilized regression beyond acyclic fully observed models.
[LG-171] A Semi-Supervised Kernel Two-Sample Test
链接: https://arxiv.org/abs/2605.01775
作者: Gyumin Lee,Shubhanshu Shekhar,Ilmun Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We consider the problem of two-sample testing in a semi-supervised setting with abundant unlabeled covariate data. Standard two-sample tests neglect covariate information, which has the potential to significantly boost performance. However, incorporating covariates potentially breaks the exchangeability assumption under the null, which further complicates a calibration procedure. To address these issues, we propose a semi-supervised method that produces a test statistic with asymptotic normality, while effectively integrating additional information from covariates. Our test is straightforward to calibrate due to the asymptotic normality under the null and achieves asymptotic power that is often much higher than existing kernel tests without covariates. Furthermore, we formally show that the proposed method is consistent in power against fixed and local alternatives. Simulations confirm the practical and theoretical strengths of our approach.
[LG-172] Distributional Causal Mediation via Conditional Generative Modeling
链接: https://arxiv.org/abs/2605.01765
作者: Jinlun Zhang,Haoneng Huang,Zishu Zhan,Chunquan Ou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Mediation analysis has traditionally focused on outcome-level summary contrasts, such as mean effects, which may obscure substantial distributional changes induced by complex and nonlinear causal mechanisms. We propose Distributional Causal Mediation Analysis (DCMA), a generative learning framework for identifying and estimating treatment effects on entire outcome distributions transmitted through multiple mediators. DCMA learns conditional generative models for the mediators and the outcome, recovering the relevant conditional distributions from observational data. Leveraging the identification formulas, it reconstructs interventional outcome distributions via Monte Carlo forward simulation by noise resampling, enabling the capture of both classical summary effects and rich distributional contrasts such as energy distance and the Wasserstein distance. Analytical error bounds are derived to decompose how estimation errors in the learned conditional models propagate to the reconstructed interventional outcome distributions. The empirical effectiveness of DCMA is demonstrated through numerical experiments and real-world data applications.
[LG-173] PRCD-MAP: Learning How Much to Trust Imperfect Priors in Causal Discovery
链接: https://arxiv.org/abs/2605.01669
作者: Xihang Shan,Da Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:External priors of unknown reliability create a brittle trade-off in causal discovery: blind trust amplifies errors, blind rejection wastes signal. Real priors are also \emphheterogeneously reliable – physical laws are trustworthy, LLM-suggested edges are speculative – yet existing methods either ignore priors or impose them through globally uniform trust. We propose \textbfPRCD-MAP, a soft prior-consumption layer that assigns \emphper-edge trust to an imperfect prior and uses it to modulate a prior-aware \ell_1 penalty and prior-weighted \ell_2 regularizer in a MAP objective. Trust is calibrated by empirical Bayes on a Laplace-approximated marginal likelihood and propagated along the prior graph by an MLP, so that data-confirmed neighborhoods boost trust and contradictions suppress it. PRCD-MAP enjoys a population-level safety guarantee: it is \varepsilon -safe in expectation over the prior-generation distribution, with \varepsilon = O(d^2/T) – inheriting the oracle convergence rate. When the prior is uninformative, learned trust provably collapses to its floor and the method recovers a no-prior baseline. Empirically, on real CausalTime data PRCD-MAP exploits informative priors when present ( +0.123 AUROC on AQI, +0.043 on Medical over PCMCI+), auto-attenuates on the anonymous-variable Traffic stress test, and retains a lead at d=300 ; against BayesDAG~\citepannadani2023bayesdag – the closest soft-Bayesian baseline – PRCD-MAP wins on every CausalTime dataset under a matched W_0 -only protocol. A four-way ablation isolates each component: EB calibration and MLP trust propagation jointly carry the plurality of the gain, with positive sign on every dataset. Extensions to nonlinear (NAM) and cross-sectional settings show the calibrated-trust principle is setting-agnostic.
[LG-174] Exact Loop Controllers for ReLU Realization of Homogeneous Curve Refinements
链接: https://arxiv.org/abs/2605.01655
作者: Boldsaikhan Bolorkhuu,Tsogtgerel Gantumur
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG)
*备注: 39 pages, 6 figures
Abstract:We study homogeneous refinement operators ((V\gamma)(t)=\sum_j\in\mathbb ZA_j\gamma(Mt-j)), acting on compactly supported continuous piecewise linear curves (\gamma:\mathbb R\to\mathbb R^p), where (M\ge2) and only finitely many matrices (A_j\in\mathbb R^p\times p) are nonzero. We prove that the iterates (V^n\gamma) admit exact ReLU realizations of fixed width and depth (O(n)). The main new ingredient is an exact loop controller for the residual dynamics. Instead of propagating scalar residual surrogates, the construction transports the residual orbit by a forward-exact state on a polygonal loop. Scalar factors and digit selectors are then recovered from this loop state by complementary CPwL readouts. The loop seam is not removed, but its remaining ambiguity is confined to the final readout/selector stage, where it is harmless because the scalar atom is supported away from the seam. This gives a homogeneous (M)-ary vector-valued extension of the scalar binary refinable-function construction with a more geometric controller architecture. We also record crude exponential bounds on the network weights and biases. Affine forcing terms are handled by expanding affine iterates into finite sums of homogeneous iterates, giving exact fixed-width realizations with depth (O(n^2)), and anchored open curves reduce to compactly supported defects with affine anchor mismatch. We also describe homogeneous polygonal generators, including dragon-type examples and a self-intersecting Hilbert-type prototype in arbitrary dimension. The extended version includes stage-dependent forcing, finite-state stacking reductions, and further geometric constructions such as Koch-, Gosper-, Morton-, and connector-based Hilbert-type variants. Comments: 39 pages, 6 figures Subjects: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG) MSC classes: Primary 41A46, Secondary 41A15, 42C40, 68T07 Cite as: arXiv:2605.01655 [math.CA] (or arXiv:2605.01655v1 [math.CA] for this version) https://doi.org/10.48550/arXiv.2605.01655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-175] Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression
链接: https://arxiv.org/abs/2605.01628
作者: Fan Chen,Jian Qian,Alexander Rakhlin,Nikita Zhivotovskiy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emphnot scale-invariant: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension d=1 ; moreover, in d=1 we obtain O(\log T) scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for d1 we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emphdoubly-uniform regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph``Uniform regret bounds over \mathbbR^d for the sequential linear regression problem with the square loss’’ (ALT 2019): in d=1 we give an explicit algorithm with O(\log T) doubly-uniform regret, whereas for d1 sublinear doubly-uniform regret is impossible. Finally, under a natural \emphsmoothness condition (bounded Radon–Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for d1 without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2605.01628 [stat.ML] (or arXiv:2605.01628v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.01628 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-176] Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference
链接: https://arxiv.org/abs/2605.01579
作者: Hoang Dang,Luan Pham,Minh Nguyen
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 36 pages, 2 figures
Abstract:Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbfHow many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero? We introduce \emphMinimum Specification Perturbation (MSP), the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.
[LG-177] Hall-Like Transversal Stress and Sandpile Criticality on Real Production Networks
链接: https://arxiv.org/abs/2605.01561
作者: Diego Vallarino
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:
Abstract:This paper develops a Hall-Sandpile model of economic instability that combines a Hall-like transversal stress mechanism with sandpile threshold dynamics on a real production-network substrate. In analogy with the physical Hall effect, where exposed flows under an external field generate stress in a transversal direction, we model economic shocks as fields that act on flow-intensive, low-redundancy, low-capacity nodes and produce systemic stress through a multiplicative conversion function. The accumulated stress drives a discrete toppling rule and an avalanche dynamics whose effective activation threshold declines with transversal exposure. The model is calibrated on annual World Input–Output Database (WIOD) production networks for 2000–2014 and simulated on the 2014 substrate (2,283 country–sector nodes) under three alternative propagation normalisations to avoid mechanical near-criticality from row-stochastic operators. Controlled Monte Carlo experiments over external field intensity and redundancy stress generate four ordered regimes: stable absorption, latent fragility, critical transition, and avalanche regime. Mean avalanche size and the probabilities of finite-size systemic events \Pr(S!\geq!5) , \Pr(S!\geq!10) and \Pr(S!\geq!20) rise jointly with field intensity and redundancy stress. Tail diagnostics show regime-dependent thickening of the avalanche distribution, but the estimated tail indices remain too high to interpret as evidence of universal power-law criticality. The contribution is therefore a finite-size, real-network description of how transversal stress activates structural fragility, not a claim of self-organised criticality in the global economy.
[LG-178] Stabilizing Private LASSO under Heterogeneous Covariates via Anisotropic Objective Perturbation
链接: https://arxiv.org/abs/2605.01492
作者: Haruka Tanzawa,Ayaka Sakata
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures
Abstract:We study high-dimensional LASSO under differential privacy via objective perturbation with heterogeneous covariate scales. In practical scenarios, covariates often exhibit diverse scales; however, standard preprocessing is problematic under privacy constraints, as it consumes additional privacy budget. This heterogeneity induces effective anisotropy in the objective perturbation via the inverse Gram matrix of covariates, which can degrade the stability and accuracy of algorithms. To address this, we propose a Gram-based anisotropic objective perturbation, a ``pre-distortion" strategy that counteracts the distortion from the covariate structure to restore isotropy in the estimation process. Using an Approximate Message Passing (AMP) framework and state evolution analysis, we demonstrate that our proposed perturbation significantly stabilizes convergence and improves both statistical efficiency and privacy performance compared to standard uniform noise injection. Our results provide theoretical insights into designing stable and efficient private estimators without relying on data-dependent preprocessing.
[LG-179] Stable Localized Conformal Prediction via Transduction
链接: https://arxiv.org/abs/2605.01452
作者: Yinjie Min,Liuhua Peng,Changliang Zou
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, prediction sets often exhibit high variability in size, especially for methods with localization. We formalize this concern as set stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Conformal Prediction (StCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of StCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods, especially for those with localization, when calibration data are limited.
[LG-180] From Characterization To Construction: Generative Quantum Circuit Synthesis from Gate Set Tomography Data
链接: https://arxiv.org/abs/2605.01367
作者: King Yiu Yu,Aritra Sarkar,Erbing Hua,Maximilian Rimbach-Russ,Ryoichi Ishihara,Sebastian Feld
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 19 pages, 3 figures
Abstract:High-fidelity circuit execution on noisy intermediate-scale quantum devices is bottlenecked by compilation pipelines that disregard complex, correlated noise. To address this, this methodology article proposes a quantum machine learning control (QMLC) framework for generative quantum circuit synthesis from gate-set tomography (GST) data that bypasses the traditional two-step pipeline of characterizing native quantum gates via GST followed by unitary decomposition algorithms. Instead, a generative concept space is directly learnt from GST data, enabling conditional synthesis of quantum circuits on a desired output distribution. Our approach tokenizes GST germ circuits and embeds them into a structured latent space using a curriculum-learning-motivated strategy, starting with short circuits and progressively incorporating longer ones with diverse output statistics. The embedded sequences are processed by a set-vision transformer with permutation-invariant pooling, producing k-seed vectors that represent the learned concept space of the quantum device. Aggregating data across multiple circuits makes this latent representation inherently context-aware, capturing the shared physical noise environment (e.g., crosstalk, drift) that isolated gate metrics miss. We propose an unconditional diffusion model to sample from the concept space. During inference, a user provides a target measurement distribution, and the model generates a corresponding circuit. To ensure fidelity and robustness, the output is denoised using a diffusion model that operates on the target conditional covariance matrix. This end-to-end framework is a step towards context-aware, hardware-native circuit synthesis directly from raw GST data, which offers a new paradigm for integrating quantum control and compilation. The QMLC framework is particularly suited for near-term quantum devices with complex calibration procedures.
[LG-181] Data-Driven Geometry-Aware Optimal-Transport Calibration of Flavor Tagger
链接: https://arxiv.org/abs/2605.01363
作者: Yeonjoon Kim,Un-ki Yang
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Methodology (stat.ME)
*备注: 32 Pages, 12 Figures
Abstract:Flavor-tagging calibrations are often provided either as scale factors measured at a finite set of working points or as binned corrections to a chosen one-dimensional discriminant. However, this approach falls short of providing continuous, event-level calibration across the full multicomponent outputs of modern taggers. This limitation leads to information loss in analyses that demand high-performance flavor tagging, restricting analyses to a limited set of predefined variables. In this work, we propose a geometry-aware framework that formulates flavor-tagger calibration as an optimal transport problem on the probability simplex. The transport maps are parameterized and trained in the isometric log-ratio coordinate system. Because the quadratic Euclidean cost of Brenier transport in this coordinate system is equivalent to the Aitchison distance on the simplex, the learned map induces a minimal deformation under the Aitchison geometry. Furthermore, we extract flavor-conditional target distributions directly from control-region data using an expectation-maximization (EM) technique that simultaneously fits multiple control regions, models each flavor component with a normalizing flow, and estimates the regional mixture fractions. The extracted targets are subsequently used to learn flavor-factorized transport maps. Because the joint estimation of mixture fractions and flexible component densities admits weakly constrained directions, we further introduce a linearized feedback-operator analysis that propagates the fitted composition covariance into the extracted component densities, separating data-constrained modes from those dominated by the composition prior. The simulation-based closure study demonstrates improved closure in dedicated control regions and in independent validation mixtures. Comments: 32 Pages, 12 Figures Subjects: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Methodology (stat.ME) Cite as: arXiv:2605.01363 [hep-ex] (or arXiv:2605.01363v1 [hep-ex] for this version) https://doi.org/10.48550/arXiv.2605.01363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-182] Mean Testing under Truncation beyond Gaussian
链接: https://arxiv.org/abs/2605.01335
作者: Yuhao Wang,Roberto Imbuzeiro Oliveira,Themis Gouleakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We characterize the fundamental limits of high-dimensional mean testing under arbitrary truncation, where samples are drawn from the conditional distribution P(\cdot \mid S) for an unknown truncation set S that may hide up to an \varepsilon -fraction of the probability mass. For distributions with p -th directional moments of magnitude at most \nu_P,p , truncation induces a bias of order O(\nu_P,p\varepsilon^1-1/p) . This bias creates a sharp information-theoretic detectability floor: when the signal \alpha falls below this threshold, the null and alternative hypotheses are indistinguishable even with infinite data. Above this floor, we prove that a simple second-order test achieving near-optimal sample complexity n = O!\left(\frac|\Sigma_P|(\alpha-4\nu_P,p\varepsilon^1-1/p)^2\sqrtd\right) . We further identify a structural escape from this finite-moment bias barrier. Under a directional median regularity assumption, truncation bias improves to linear order O(\varepsilon) . This reveals an intermediate regime in which estimation requires \Theta(d) samples for uniform recovery, while testing recovers the classical \Theta(\sqrt d) rate once truncation bias is eliminated. Together, our results provide a unified framework for mean testing under truncation, connecting finite-moment, sub-Gaussian, and median-regular structural regimes.
[LG-183] Barren Plateaus as Destructive Interference: A Diagnostic Framework and Implications for Structured Ansatzes
链接: https://arxiv.org/abs/2605.01319
作者: Pilsung Kang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Barren plateaus (BPs) are usually described by the exponential suppression of gradient variance, but the mechanism by which gradient signal disappears remains unclear. We show that this phenomenon can be understood as destructive interference among termwise gradient contributions. To make this perspective operational, we introduce a diagnostic framework based on the cancellation ratio R_k , the effective term count N_\mathrmeff,k , and the interference-quality measure B_\mathrmeff,k=R_k\sqrtN_\mathrmeff,k . Under a random-sign model, B_\mathrmeff,k remains near a stable baseline, defining a random-sign cancellation regime. For the transverse-field Ising model (TFIM), we find that the hardware-efficient ansatz (HEA) remains close to this regime across system sizes and depths, whereas the Hamiltonian variational ansatz (HVA) systematically escapes it. In particular, HVA exhibits larger B_\mathrmeff,k not merely because N_\mathrmeff,k is larger, but because R_k also remains systematically larger despite the broader term participation. This pattern indicates improved sign organization rather than simple term suppression. We further establish an exact identity that connects the proposed interference diagnostics directly to the standard variance-based theory of BPs. These results position destructive interference as a mechanistic interpretation of BP-like behavior in the regimes studied here, but they do not imply that BPs and destructive interference are universally interchangeable across all architectures and settings.
[LG-184] Machine Learning Enhanced Laser Spectroscopy for Multi-Species Gas Detection in Complex and Harsh Environments
链接: https://arxiv.org/abs/2605.01306
作者: Mohamed Sy
类目: Optics (physics.optics); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: PhD thesis
Abstract:Laser absorption spectroscopy (LAS) is a well-established technique for non-intrusive measurement of gas species in combustion and atmospheric environments, but conventional methods struggle with multi-species mixtures under dynamic or interference-laden conditions. Overlapping spectral features, noise, and incomplete reference data limit reliability when unknown or weakly absorbing species are present. This dissertation develops diagnostics combining LAS with machine learning (ML) to address these limitations. Deep denoising autoencoders (DDAEs) are applied to shock-tube measurements during high-speed hydrocarbon pyrolysis, improving signal fidelity and detection limits for trace species. A structured unsupervised framework, HT-SIMNet, then mitigates interference from unknown species without full calibration data, using spectral augmentation and a Noise2Noise-inspired scheme to isolate species in reactive systems. Where reference spectra are unavailable, UnblindMix, an autoencoder-based blind source separation method, reconstructs concentrations and spectral signatures directly from mixture data, validated on mixtures of up to eight components. To recover weakly absorbing species masked by broader absorbers, a feature-engineering method based on first derivatives and convolutions selectively highlights minor species. Finally, VOC-certifire combines randomized smoothing with Voigt-based spectral perturbation to provide certifiable classification of volatile organic compounds under varying conditions. All techniques are experimentally validated and benchmarked. The integration of spectroscopic hardware with ML offers a path toward real-time, interference-resilient, reference-free gas detection for combustion science, environmental monitoring, and industrial safety.
[LG-185] Reconstructing conformal field theoretical compositions with Transformers
链接: https://arxiv.org/abs/2605.01072
作者: Haotian Cao,Garrett Merz,Kyle Cranmer,Gary Shiu
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注:
Abstract:We study the use of transformers to reconstruct the compositions of tensor products of two-dimensional rational conformal field theories (RCFTs) based on their low-energy spectra. The task is challenging due to its combinatorial nature. The constituent theories are characterized by their central charges and affine Lie algebra labels. We achieve 98% accuracy in recovering the constituents of tensor products theories constructed from Wess-Zumino-Witten models. We further demonstrate that our method generalizes to CFTs with larger central charge and unseen classes of RCFTs by adding a small number of out-of-domain examples. Our results show that transformers are effective at this task and point towards a new tool for bulk reconstruction in AdS/CFT.
[LG-186] Pi-Change: A Prior-Informed Multiple Change Point Detection Algorithm
链接: https://arxiv.org/abs/2605.01003
作者: Jonathon Jacobs,Shanshan Chen
类目: Methodology (stat.ME); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Statistical change point (CP) detection methods typically rely on likelihood-based inference and ignore contextual information about plausible CP locations beyond the observed sequence. Although informative priors provide a natural way to incorporate such information, general and computationally efficient methods for doing so are lacking, especially for multiple CP detection. To address this gap, we propose a prior-informed CP detection algorithm (Pi-Change) that incorporates prior information on CP locations through a time-varying penalty term. We prove that the proposed penalty can be embedded in the Pruned Exact Linear Time framework while preserving the dynamic programming recursion and pruning rule required for efficient multiple CP detection. Across simulation studies and three time-series applications, Pi-Change discourages spurious CPs unsupported by prior information, remains robust to prior misspecification, and improves detection accuracy. More broadly, Pi-Change extends multiple CP detection beyond purely data-driven fitting by incorporating partial prior knowledge in a computationally efficient and interpretable way. It is particularly useful when CPs arise from heterogeneous mechanisms or are associated with known external events, helping quantify the delay between an event and the resulting structural change.
[LG-187] Equation-Free Digital Twins for Nonlinear Structural Dynamics
链接: https://arxiv.org/abs/2605.00950
作者: Mohammad Mahdi Abaei,Ahmad BahooToroody,Arttu Polojärvi,Heikki Remes,Ulf Tyge Tygesen,Mikko Suominen,Michael Beer
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Monitoring high-dimensional engineering structures in extreme environments is limited by non-stationary excitation, nonlinear structural kinematics, and stochastic forcing. Traditional model-based and black-box data-driven methods often struggle to resolve these dynamics in real time, particularly under sensor failure or partial observability. This paper introduces a rank-optimized digital twin framework based on Koopman operator theory, Hankel-matrix embeddings, and dynamic mode decomposition. By lifting operational data into a linear invariant subspace, the method enables autonomous, input-blind reconstruction of structural states without requiring a priori mass or stiffness matrices. The framework is validated on an NREL 5MW spar-buoy floating offshore wind turbine, representing a challenging coupled aero-hydro-servo-elastic system. Results show that the rank-optimized Koopman-Hankel manifold separates structural resonances from deterministic 3P rotor harmonics under colored noise, where standard subspace identification can be unreliable. A rolling-horizon virtual sensing strategy achieves high-fidelity reconstruction at critical structural hotspots, with coefficient of determination greater than 0.95 at 1 Hz data assimilation and accuracy exceeding 0.99 at higher sampling rates. By estimating a physical Lyapunov time of approximately 1.0 s, the study defines the predictability horizon associated with the system information barrier. The proposed framework provides a computationally efficient and resilient digital twin approach for real-time identification and virtual sensing of complex structural dynamics. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2605.00950 [eess.SP] (or arXiv:2605.00950v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2605.00950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-188] An ALE-Consistent Graph Neural Operator-Transformer Framework for Fluid-Structure Interaction
链接: https://arxiv.org/abs/2605.00937
作者: Shihang Zhao,Martín Saravia,Haokui Jiang,Zhiyang Xue,Shunxiang Cao
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 29 pages, 20 figures
Abstract:We propose an arbitrary Lagrangian-Eulerian (ALE)-consistent machine learning framework for long-term fluid-structure interaction (FSI) prediction on deforming unstructured meshes. Specifically, the fluid dynamics are modeled by a surrogate that combines a graph neural operator (GNO) with a vision Transformer (ViT) for spatiotemporal prediction, while a lightweight long short-term memory (LSTM) network predicts structural kinematics at the interface. The two surrogates are coupled through a standard partitioned procedure. Most importantly, kinematic compatibility at the moving interface is enforced via an ALE-consistent boundary-correction step that updates the fluid-side interface velocity with the predicted structural velocity at each coupling update, thereby improving near-interface accuracy and long-term rollout stability. To mitigate autoregressive error accumulation, a two-stage training strategy is adopted, consisting of single-step supervised pretraining followed by long-term autoregressive fine-tuning. The proposed framework is validated on the benchmark problem of a flexible beam vibration in the wake of a cylinder. Results demonstrate accurate phase-consistent predictions over long rollouts and robust generalization under inlet-profile variations in both interpolation and extrapolation settings. Systematic ablation studies further assess the respective contributions of the ViT module, ALE-consistent boundary correction, and long-term training to predictive accuracy and rollout robustness.
[LG-189] A Deep Learning Model for Battery State Prediction towards Intelligent Energy Management
链接: https://arxiv.org/abs/2605.00898
作者: Athanasios Koukosiasa,Vasileios Tzanidakis,Sotiris Athanasiou,Kostas Kolomvatsos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, 11 figures, Journal
Abstract:Accurate forecasting of battery health indicators, including remaining capacity and lifetime, is of paramount importance for ensuring the reliability, safety, and operational efficiency of applications such as electric vehicles and large scale energy storage infrastructures. The result of the forecasting can be adopted to build an advanced monitoring mechanism for continuous checking batteries’ health status to assist in the efficient real-time management of numerous applications. This research investigates the development and implementation of a Deep Learning (DL) model for the prediction of the future state and performance of industrial electrochemical energy storage systems. To address this challenge, we propose a dedicated computational framework that integrates advanced neural network architectures with large-scale training datasets, enabling precise modeling of batteries degradation dynamics and operational trends. The proposed approach provides a decision support mechanism for the optimal management of batteries facilitating both predictive maintenance and the efficient allocation of energy resources. Our findings highlight the potential of DL-based predictive modeling to significantly contribute to the advancement of sustainable and intelligent energy management systems.
[LG-190] Autonomous Reliability Qualification of Ga_2O_3-based Hydrogen and Temperature Sensors via Safe Active Learning
链接: https://arxiv.org/abs/2605.00868
作者: Davi Febba,William A. Callahan,Anna Sacchi,Andriy Zakutayev
类目: Applied Physics (physics.app-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We present a Safe Active Learning (SAL) framework for autonomous reliability characterization of rectifying Ga _2 O _3 -based devices under coupled thermal and hydrogen stress. SAL treats rectification as a device-physics-motivated safety observable and models its evolution over elapsed time, temperature, and H _2 concentration using a Gaussian-process surrogate. To handle condition-dependent and uncertain experiment durations, the method combines an adaptive completion-time window, time-window lower-confidence-bound safety checks, a trust region anchored to previously verified safe conditions, and a two-phase strategy that transitions from conservative safe exploration to progressively relaxed rectification targets as the device degrades. We first evaluate SAL in simulation, where it safely expands the explored region while learning the evolving rectification surface. We then demonstrate SAL experimentally on an automated high-temperature probe-station platform using a Pt/Cr _2 O _3 :Mg/ \beta -Ga _2 O _3 device. In the reported campaign, phase 1 incurred only one unsafe measurement associated with spurious current-voltage sweeps, while phase 2 intentionally probed lower-rectification regimes. Finally, we use the curated SAL dataset for offline long-horizon forecasting of device response at a target voltage using a structured Gaussian-process model with a condition-dependent Kohlrausch–Williams–Watts mean and a residual covariance kernel. The model captures long-time, saturating degradation trends in an auxiliary validation dataset, illustrating how safety-aware autonomous experimentation enables both conservative characterization and subsequent degradation modeling. Although demonstrated for a rectifying Ga _2 O _3 device, SAL is applicable to other systems where a measurable in situ safety observable can be defined.
[LG-191] An Adaptive Spatiotemporal Clustering Framework for 3D Ocean Subsurface Temperature Reconstruction
链接: https://arxiv.org/abs/2605.00860
作者: Ming Shan Loo,Wengen Li,Xudong Jiang,Hailiang Cheng,Zhifei Zhang,Jihong Guan,Yichao Zhang
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:The reconstruction of ocean subsurface temperature (OST) using satellite remote sensing data holds significant scientific value for advancing the understanding of ocean dynamics and climate variability. However, the scarcity of subsurface observations, combined with the high degree of nonlinearity and spatiotemporal heterogeneity in subsurface processes, poses substantial challenges to the accuracy and generalization capability of traditional reconstruction methods. To address these limitations, this study proposes an adaptive framework that could capture both vertical structural dependencies and temporal variation patterns of OST via spatio-temporal clustering. By incorporating this framework with various deep learning models, e.g., dual-path convolutional neural networks (DP-CNN), Attention U-Net, and Vision Transformer (ViT), the OST field can be accurately reconstructed at a global scale only using surface observations, i.e., sea surface temperature (SST), sea surface salinity (SSS), sea surface height (SSH), and sea surface wind (SSW). Experimental results demonstrate that multiple deep learning methods using the proposed framework largely outperform their original counterparts, yielding improvements in RMSE ranging from 12.4% to 27.2%. This study provides a reliable solution for subsurface temperature reconstruction, offering important implications for meteorological modeling and climate change assessment.
[LG-192] A Hybrid Windkessel-Neural Approach for Improved Noninvasive Blood Pressure Monitoring
链接: https://arxiv.org/abs/2605.00858
作者: Vaibhav Gollapalli,Aniruth Ananthanarayanan
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Owing to the recent advancements in wearable devices for health care, the importance of BP estimation without cuffs increases. Cuff technologies are inappropriate for continuous BP measurement due to their inconvenient usage, invasive character, necessity of calibration, large size, and inability to perform long-term monitoring. Normally, the algorithm used for cuffless BP prediction employs machine learning models that operate according to the data-driven approach. However, although they show high numerical accuracy, ML models do not provide any interpretability, resulting in poor physiological validity and clinical applicability. We propose a combination of Windkessel and ML models that incorporates the physical aspects into the latter one. It is performed by reformulating Windkessel into a form that will allow employing ML models. The result is a system of ODEs which can be used in the neural network. Thus, the inclusion of physical constraints improves the data-driven approach by making models consistent with physics, understandable, and robust. For illustration, we apply the described technique using a publicly available MIMIC-II database that we obtain from the UCI Machine Learning Repository.
[LG-193] An Efficient Spatial Branch-and-Bound Algorithm for Global Optimization of Gaussian Process Posterior Mean Functions
链接: https://arxiv.org/abs/2605.00855
作者: Wei-Ting Tang,Akshay Kudva,Calvin Tsay,Joel A. Paulson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the deterministic global optimization of trained Gaussian process posterior mean functions over hyperrectangular domains. Although the posterior mean function has a compact closed-form representation, its global optimization is challenging because it remains nonlinear and nonconvex. Existing exact deterministic approaches become increasingly difficult to scale as the number of training data points grows, leading to approximation-based methods that improve tractability by optimizing a modified (inexact) objective. In this work, we propose PALM-Mean, a piecewise-analytic lower-bounding framework embedded in reduced-space spatial branch-and-bound. At each node, kernel terms that are locally important are replaced by a sign-aware piecewise-linear relaxation in an appropriate scalar distance variable, while the remaining terms are bounded analytically in closed form. We show this hybrid approach yields a valid lower bound for the posterior mean, while limiting the size of the branch-and-bound subproblems. We establish validity of the node lower bounds and \varepsilon -global convergence of the resulting algorithm. Computational results on synthetic benchmarks and real-world application problems show that PALM-Mean improves scalability relative to representative general-purpose deterministic global solvers, particularly as the number of training data points increases.
[LG-194] Deep Learning for Multi-Antenna Modulation Recognition of Radio Signals
链接: https://arxiv.org/abs/2605.00849
作者: Tao Chen,Shilian Zheng,Jiepeng Chen,Zhangbin Pei,Qi Xuan,Xiaoniu Yang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Multi-antenna receiving systems have become a prevalent technical solution in communication systems. Meanwhile, deep learning has achieved significant progress in automatic modulation recognition tasks in single-antenna systems. However, the application of deep learning in multi-antenna modulation recognition (MAMR) tasks is still limited. In this paper, we propose an MAMR method namely MAMR-IQ to fully explore the diversity gain of a multi-antenna receiving system, which concatenates the raw received in-phase and quadrature (IQ) signals of multiple antennas and feeds them into a convolutional neural network. Simulation results show that the proposed MAMR-IQ method outperforms two existing deep learning-based MAMR methods which are based on direct voting (DV) and weight average (WA) in terms of both recognition accuracy and computational complexity. To address the problem of limited training data in few-shot scenarios, we further propose a data augmentation method that involves exchanging IQ sequences received by any two antennas to generate augmented samples. Simulation results show that with the proposed augmentation method, the recognition accuracy can be further improved.
附件下载


