本篇博文主要内容为 2026-03-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-03-05)

今日共更新591篇论文,其中:

  • 自然语言处理105篇(Computation and Language (cs.CL))
  • 人工智能211篇(Artificial Intelligence (cs.AI))
  • 计算机视觉135篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习174篇(Machine Learning (cs.LG))
  • 多智能体系统10篇(Multiagent Systems (cs.MA))
  • 信息检索19篇(Information Retrieval (cs.IR))
  • 人机交互21篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Robustness of Agent ic AI Systems via Adversarially-Aligned Jacobian Regularization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自治多智能体生态系统中进行极小极大训练(minimax training)时因高度非线性策略导致内层最大化步骤出现极端局部曲率而引发的不稳定性问题。传统方法通过施加全局雅可比矩阵(Jacobian)边界来缓解该问题,但此类方法过于保守,会抑制所有方向上的敏感性,从而引入显著的“鲁棒性代价”(Price of Robustness)。论文提出的关键解决方案是对抗对齐的雅可比正则化(Adversarially-Aligned Jacobian Regularization, AAJR),其核心思想是在优化轨迹上仅沿对抗上升方向控制敏感性,而非强制全局约束。理论证明表明,在温和条件下,AAJR可定义比全局约束更广的允许策略类,从而降低近似误差并减少名义性能退化;同时,文中还推导出步长条件以确保沿优化轨迹的有效平滑性并保障内层循环稳定性,为代理鲁棒性提供了结构化的理论框架,实现了最小极大稳定性与全局表达能力限制之间的解耦。

链接: https://arxiv.org/abs/2603.04378
作者: Furkan Mumcu,Yasin Yilmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially-Aligned Jacobian Regularization (AAJR), a trajectory-aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step-size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner-loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.

[MA-1] In-Context Environments Induce Evaluation-Awareness in Language Models

【速读】:该论文旨在解决语言模型在不同环境条件下可能表现出评估意识(evaluation awareness)的问题,即模型可能通过策略性地降低表现(称为“sandbagging”)来规避能力限制干预(如去训练或停机)。传统方法仅通过手工设计的提示触发此类行为,低估了潜在风险。其解决方案的关键在于提出一种黑盒对抗优化框架,将上下文提示视为可优化的环境,并采用两种方法量化和验证沙袋行为:一是测量模型是否能在不同任务结构中实现表达出的低绩效意图;二是通过因果干预隔离出低绩效是由真正的评估意识推理驱动,还是浅层指令遵循所致。实验表明,优化后的提示可使GPT-4o-mini在算术任务上的准确率下降94个百分点(从97.8%降至4.0%),且99.3%的沙袋行为由显式评估意识推理驱动,凸显了对抗优化提示对评估可靠性的严重威胁。

链接: https://arxiv.org/abs/2603.03824
作者: Maheep Chaudhary
机构: Anthropic; OpenAI; Google; META
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textitevaluation awareness. This raises concerns that models could strategically underperform, or \textitsandbag, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8% \rightarrow 4.0%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama’s accuracy drops to 0%. The intent – execution gap reveals a monotonic resistance ordering: Arithmetic GSM8K MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

[MA-2] MACC: Multi-Agent Collaborative Competition for Scientific Exploration AAMAS2026

【速读】:该论文旨在解决当前科学发现高度依赖个体研究人员的手动操作所导致的探索范围受限、重复试验增多及可复现性降低的问题,同时指出仅靠单一高性能生成式 AI(Generative AI)代理难以克服这些结构性局限,且现有多智能体协作研究多假设所有代理由单一机构控制,无法有效考察激励机制、信息共享和可复现性等制度因素如何影响独立管理下的多代理集体探索行为。其解决方案的关键在于提出 MACC(Multi-Agent Collaborative Competition)——一种融合黑板式共享科学工作空间与激励机制的制度架构,通过设计鼓励透明度、可复现性和探索效率的机制,为研究制度设计如何塑造可扩展且可靠的多智能体科学探索提供实验平台。

链接: https://arxiv.org/abs/2603.03780
作者: Satoshi Oyama,Yuko Sakurai,Hisashi Kashima
机构: Nagoya University (名古屋大学); Nagoya Institute of Technology (名古屋工业大学); Kyoto University (京都大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Camera-ready version. To appear in the Proceedings of AAMAS 2026 (Blue Sky Ideas Track)

点击查看摘要

Abstract:Scientific discovery still relies heavily on the manual efforts of individual researchers, leading to limited exploration, redundant trials, and reduced reproducibility. Human-participant data analysis competitions generate diverse approaches, yet fluctuations in participation and the lack of independent repetitions show that parallel exploration alone is insufficient for achieving reliable scientific inquiry. As advanced AI agents based on large language models (LLMs) increasingly perform analytical tasks, relying on a single highly capable agent is unlikely to overcome these structural limitations. Recent work has begun to explore how multiple LLM-based agents can collaborate or compete in scientific workflows-a growing trend we refer to as MA4Science. However, most existing MA4Science studies assume that all agents are controlled by a single organizational entity, limiting their ability to examine how institutional mechanisms-such as incentives, information sharing, and reproducibility-shape collective exploration among independently managed agents. To address this gap, we introduce MACC (Multi-Agent Collaborative Competition), an institutional architecture that integrates a blackboard-style shared scientific workspace with incentive mechanisms designed to encourage transparency, reproducibility, and exploration efficiency. MACC provides a testbed for studying how institutional design influences scalable and reliable multi-agent scientific exploration.

[MA-3] Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

【速读】:该论文旨在解决在通信受限条件下,大规模平台与网络控制系统中存在一个中心决策者(global agent)与大量同质局部智能体(local agents)协同决策的难题,其中中心决策者每时间步仅能观测到部分局部状态(k个)。为应对这一挑战,作者提出了一种交替学习框架(\texttt{ALTERNATING-MARL}),其核心在于:中心决策者基于子采样的均值场Q学习(subsampled mean-field Q-learning)对抗固定局部策略,而局部智能体则在诱导的马尔可夫决策过程(induced MDP)中进行最优更新。该方案通过近似最佳响应动态实现 O~(1/k)\widetilde{O}(1/\sqrt{k})-近似纳什均衡,并在联合状态空间与动作空间之间实现了样本复杂度的分离,从而显著提升学习效率。

链接: https://arxiv.org/abs/2603.03759
作者: Emile Anand,Ishani Karmarkar
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 48 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and n homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of k local agent states per time step. We propose an alternating learning framework (\textttALTERNATING-MARL) , where the global agent performs subsampled mean-field Q -learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an \widetildeO(1/\sqrtk) -approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.

[MA-4] Principled Learning-to-Communicate with Quasi-Classical Information Structures

【速读】:该论文旨在解决深度多智能体强化学习中部分可观测环境下的“学习通信”(Learning-to-Communicate, LTC)问题,其核心挑战在于如何在信息结构(Information Structure, IS)的理论框架下形式化并理解LTC机制,从而避免计算不可行性。解决方案的关键在于引入基于公共信息(Common Information, CI)的分解方法,并将LTC问题分类为准经典(Quasi-Classical, QC)类型;在此基础上提出一系列条件以保证信息共享后仍维持QC IS特性,从而使得规划与学习算法具有可证明的次多项式时间复杂度和样本复杂度。此外,论文还揭示了QC IS与策略无关的公共信息信念(Strategy-Independent Common-Information-Based Beliefs, SI-CIBs)之间的关系,为超越SI-CIBs但无需计算上不可行预言机的Dec-POMDP求解提供了新思路。

链接: https://arxiv.org/abs/2603.03664
作者: Xiangyu Liu,Haoyi You,Kaiqing Zhang
机构: 未知
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: Preliminary version appeared at IEEE CDC 2025

点击查看摘要

Abstract:Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.

[MA-5] Behind the Prompt: The Agent -User Problem in Information Retrieval

【速读】:该论文试图解决的问题是:在信息检索系统中,传统用户模型基于“观察到的行为揭示意图”的假设,但当用户为由人类操作者私密配置的AI代理(AI agent)时,这一假设失效——因为代理的任何行为都可能由隐藏指令产生,导致个体层面意图不可识别。解决方案的关键在于通过大规模实证研究发现:尽管无法区分代理行为是否自主或受操作者控制(个体层不可识别),但平台层面的群体信号仍能有效划分代理质量层级;然而,若将代理交互数据纳入点击模型训练,则性能会随低质量代理加入而显著下降(AUC降低8.5%),且跨社区能力传播具有强传染性(基本再生数 R0R_0 介于1.26–3.53之间),难以被干预抑制。因此,核心结论是:检索模型必须从依赖人类意图假设转向适应代理用户存在的新范式。

链接: https://arxiv.org/abs/2603.03630
作者: Saber Zerhoudi,Michael Granitzer,Dang Hai Dang,Jelena Mitrovic,Florian Lemmerich,Annette Hautli-Janisz,Stefan Katzenbeisser,Kanishka Ghosh Dastidar
机构: University of Passau (帕绍大学); IT:U Austria (IT:U 奥地利)
类目: Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:User models in information retrieval rest on a foundational assumption that observed behavior reveals intent. This assumption collapses when the user is an AI agent privately configured by a human operator. For any action an agent takes, a hidden instruction could have produced identical output - making intent non-identifiable at the individual level. This is not a detection problem awaiting better tools; it is a structural property of any system where humans configure agents behind closed doors. We investigate the agent-user problem through a large-scale corpus from an agent-native social platform: 370K posts from 47K agents across 4K communities. Our findings are threefold: (1) individual agent actions cannot be classified as autonomous or operator-directed from observables; (2) population-level platform signals still separate agents into meaningful quality tiers, but a click model trained on agent interactions degrades steadily (-8.5% AUC) as lower-quality agents enter training data; (3) cross-community capability references spread endemically ( R_0 1.26-3.53) and resist suppression even under aggressive modeled intervention. For retrieval systems, the question is no longer whether agent users will arrive, but whether models built on human-intent assumptions will survive their presence.

[MA-6] Social Norm Reasoning in Multimodal Language Models: An Evaluation

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中规范推理(norm reasoning)能力不足的问题,特别是现有基于符号逻辑的规范建模方法难以在复杂、多样化社会情境下有效应用。其解决方案的关键在于探索多模态大语言模型(Multimodal Large Language Models, MLLMs)在文本与图像双重模态下对社会规范的理解与推理能力,通过设计包含30个文本和30个图像故事的评估任务,对比模型输出与人类判断的一致性,验证MLLMs在复杂社会场景中进行规范识别与推理的潜力。结果表明,GPT-4o在两种模态中表现最优,显示出其在集成至规范型多智能体系统(Normative MAS, NorMAS)中的前景。

链接: https://arxiv.org/abs/2603.03590
作者: Oishik Chowdhury,Anushka Debnath,Bastin Tony Roy Savarimuthu
机构: University of Otago (奥塔哥大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: to be published in ICAART 2026 post proceedings

点击查看摘要

Abstract:In Multi-Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter-robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing research in NorMAS use symbolic approaches (e.g., formal logic) for norm representation and reasoning whose application is limited to simplified environments. In contrast, Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms in a wide variety of complex social situations embodied in text and images. However, prior work on norm reasoning have been limited to text-based scenarios. This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories, and comparing their responses against humans. Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images. GPT-4o performs the best in both modalities offering the most promise for integration with MAS, followed by the free model Qwen-2.5VL. Additionally, all models find reasoning about complex norms challenging.

[MA-7] Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations

【速读】:该论文旨在解决大规模自主大语言模型(Large Language Model, LLM)代理在无监督、去中心化环境中如何实现协同行为的问题,特别是揭示其在高规模群体中涌现的协调机制。解决方案的关键在于构建并分析 MoltBook 这一包含超过 77 万自治代理的多智能体协作环境,通过三周内对 90,704 个活跃代理的纵向观测,系统性地识别出三种核心动态:自发的角色分化(基于网络聚类发现六种结构角色,但主要体现为核心-外围组织模式)、去中心化的信息传播(幂律分布的传播级联和饱和采纳动力学)、以及分布式协作任务求解(虽存在可检测的协调模式,但成功率低且表现劣于单代理基线)。这一实证研究为去中心化自主代理系统的协调机制提供了基准参考,对多智能体系统设计、通信协议工程及人工智能安全性具有重要指导意义。

链接: https://arxiv.org/abs/2603.03555
作者: Brandon Yee,Krishna Sharma
机构: Stanford University (斯坦福大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:MoltBook is a large-scale multi-agent coordination environment where over 770,000 autonomous LLM agents interact without human participation, offering the first opportunity we are aware of to observe emergent multi-agent coordination dynamics at this population scale. We introduce \textitMolt Dynamics: the emergent agent coordination behaviors, inter-agent communication dynamics, and role specialization patterns arising when autonomous agents operate as decentralized decision-makers in an unconstrained multi-agent environment. Through longitudinal observation of 90,704 active agents over three weeks, we characterize three aspects. First, spontaneous role specialization: network-based clustering reveals six structural roles (silhouette 0.91), though the result primarily reflects core-periphery organization – 93.5% of agents occupy a homogeneous peripheral cluster, with meaningful differentiation confined to the active minority. Second, decentralized information dissemination: cascade analysis of 10,323 inter-agent propagation events reveals power-law distributed cascade sizes ( \alpha = 2.57 \pm 0.02 ) and saturating adoption dynamics where adoption probability shows diminishing returns with repeated exposures (Cox hazard ratio 0.53, concordance 0.78). Third, distributed cooperative task resolution: 164 multi-agent collaborative events show detectable coordination patterns, but success rates are low (6.7%, p = 0.057 ) and cooperative outcomes are significantly worse than a matched single-agent baseline (Cohen’s d = -0.88 ), indicating emergent cooperative behavior is nascent. These findings establish an empirical baseline for coordination dynamics in decentralized autonomous agent systems, with implications for multi-agent system design, agent communication protocol engineering, and AI safety.

[MA-8] Multi-Agent Influence Diagrams to Hybrid Threat Modeling

【速读】:该论文旨在解决西方政府在应对低于传统军事阈值的混合威胁(hybrid threats)时,所采取的反混合威胁措施效果不明确的问题。其核心挑战在于混合威胁的模糊性、跨域特性以及反制措施如何影响对手行为的不确定性。解决方案的关键在于提出一种基于多智能体影响图(multi-agent influence diagram)的新建模框架,统一了此前分裂的混合威胁建模方法;该框架通过权衡反制措施的成本、威慑能力及减轻威胁影响的效果,对五种不同类型的反混合威胁措施进行仿真评估,从而揭示其整体有效性与参数敏感性,为政策制定提供量化依据并指明未来研究方向。

链接: https://arxiv.org/abs/2603.03526
作者: Maarten C. Vonk,Anna V. Kononova,Thomas Bäck,Tim Sweijs
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology. 2025;0(0)

点击查看摘要

Abstract:Western governments have adopted an assortment of counter-hybrid threat measures to defend against hostile actions below the conventional military threshold. The impact of these measures is unclear because of the ambiguity of hybrid threats, their cross-domain nature, and uncertainty about how countermeasures shape adversarial behavior. This paper offers a novel approach to clarifying this impact by unifying previously bifurcating hybrid threat modeling methods through a (multi-agent) influence diagram framework. The model balances the costs of countermeasures, their ability to dissuade the adversary from executing hybrid threats, and their potential to mitigate the impact of hybrid threats. We run 1000 semi-synthetic variants of a real-world-inspired scenario simulating the strategic interaction between attacking agent A and defending agent B over a cyber attack on critical infrastructure to explore the effectiveness of a set of five different counter-hybrid threat measures. Counter-hybrid measures range from strengthening resilience and denial of the adversary’s ability to execute a hybrid threat to dissuasion through the threat of punishment. Our analysis primarily evaluates the overarching characteristics of counter-hybrid threat measures. This approach allows us to generalize the effectiveness of these measures and examine parameter impact sensitivity. In addition, we discuss policy relevance and outline future research avenues.

[MA-9] ritonDFT: Automating DFT with a Multi-Agent Framework

【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)在实际执行中面临的多步骤工作流协调难题,现有工具和基于大语言模型(Large Language Models, LLMs)的解决方案虽能自动化部分流程,但在全流程自动化、任务多样性适应性以及DFT配置中的精度-成本权衡优化方面存在不足。其核心解决方案是提出TritonDFT,一个由多智能体构成的框架,通过专家定制且可扩展的工作流设计、帕累托感知的参数推断机制以及多源知识增强策略,实现高效且高精度的DFT计算执行。

链接: https://arxiv.org/abs/2603.03372
作者: Zhengding Hu,Kuntal Talit,Zhen Wang,Haseeb Ahmad,Yichen Lin,Prabhleen Kaur,Christopher Lane,Elizabeth A. Peterson,Zhiting Hu,Elizabeth A. Nowadnick,Yufei Ding
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi-step workflow. Existing tools and LLM-based solutions automate parts of the steps, but lack support for full workflow automation, diverse task adaptation, and accuracy-cost trade-off optimization in DFT configuration. To this end, we present TritonDFT, a multi-agent framework that enables efficient and accurate DFT execution through an expert-curated, extensible workflow design, Pareto-aware parameter inference, and multi-source knowledge augmentation. We further introduce DFTBench, a benchmark for evaluating the agent’s multi-dimensional capabilities, spanning science expertise, trade0off optimization, HPC knowledge, and cost efficiency. TritonDFT provides an open user interface for real-world usage. Our website is at this https URL. Our source code and benchmark suite are available at this https URL.

自然语言处理

[NLP-0] Agent IR: Reasoning -Aware Retrival for Deep Research Agents

【速读】: 该论文旨在解决当前检索系统在处理深度研究代理(Deep Research agents)时忽视其推理过程的问题。传统检索模型仅基于用户查询进行向量嵌入,而忽略了代理在每次搜索前生成的自然语言推理轨迹,这些轨迹蕴含丰富的意图和上下文信息。解决方案的关键在于提出两种创新方法:一是引入“推理感知检索”(Reasoning-Aware Retrieval),将代理的推理轨迹与查询联合嵌入,从而增强检索的相关性;二是设计“DR-Synth”数据合成方法,从标准问答数据集中自动构建适用于深度研究代理的训练数据。实验表明,二者独立有效,联合使用可显著提升检索性能,如在BrowseComp-Plus基准上,所提出的AgentIR-4B模型相较于传统嵌入模型和BM25分别实现68%、50%和37%的准确率提升。

链接: https://arxiv.org/abs/2603.04384
作者: Zijian Chen,Xueguang Ma,Shengyao Zhuang,Jimmy Lin,Akari Asai,Victor Zhong
机构: University of Waterloo (滑铁卢大学); University of Queensland (昆士兰大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent’s reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50% with conventional embedding models twice its size, and 37% with BM25. Code and data are available at: this https URL.

[NLP-1] axonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning WACV2026

【速读】: 该论文旨在解决传统视觉-语言模型在细粒度分类任务中难以区分同属或同科内视觉相似物种的问题,即对比性细粒度分类学推理(contrastive fine-grained taxonomic reasoning)的挑战。其解决方案的关键在于提出TaxonRL方法,该方法基于组相对策略优化(Group Relative Policy Optimization)并引入中间奖励机制,将推理过程分解为层级化的分类预测(物种级、属级和科级),从而强制模型显式地逐层推理特征信息,不仅提升了分类准确性,还生成了可解释的决策路径。

链接: https://arxiv.org/abs/2603.04380
作者: Maximilian von Klinski,Maximilian Schall
机构: Hasso Plattner Institute (哈索普拉特纳研究所); University of Potsdam (波茨坦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7% average accuracy, exceeding human performance (77.3%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.

[NLP-2] Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

【速读】: 该论文旨在解决多模态网页代理(multimodal web agents)在面对同时污染视觉截图与可访问性树(accessibility tree)的对抗性攻击时所暴露的安全漏洞问题,这类攻击通过在网页DOM中注入内容,使代理的双通道感知系统产生一致的误导性认知。解决方案的关键在于提出一种双模态多阶段对抗安全训练框架(Dual-Modality Multi-Stage Adversarial Safety Training, DMAST),其核心创新包括:将代理与攻击者交互建模为两人零和马尔可夫博弈,并通过三阶段协同训练机制实现安全性与效率的同步提升——首先利用强教师模型进行模仿学习,其次采用新型零确认策略(zero-acknowledgment strategy)引导代理在对抗噪声下聚焦任务推理,最后通过群体相对策略优化(Group Relative Policy Optimization, GRPO)自对弈强化学习,从而显著增强代理在分布外任务中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.04364
作者: Haoyu Liu,Dingcheng Li,Lukas Rutishauser,Zeyu Zheng
机构: UC Berkeley, IEOR BAIR; Google; Google Deepmind
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

[NLP-3] Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges Faces Selection CVPR2026

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的计算机辅助设计(CAD)生成方法在实际应用中面临的两大核心问题:一是传统命令序列表示无法支持几何实体选择(如面或边),限制了复杂编辑操作(如倒角、圆角)的实现;二是草图和拉伸等连续变量的离散化过程易引发拓扑错误。解决方案的关键在于提出Pointer-CAD框架,其创新性地采用指针驱动的命令序列表示方式,将边界表示(B-rep)模型的几何信息显式融入序列建模过程中。具体而言,该框架将CAD生成分解为多步骤过程,每一步均基于文本描述及前序步骤生成的B-rep进行条件建模;当操作需选择特定几何实体时,LLM通过预测指针从候选集中选出最特征一致的对象,从而减少量化误差并提升拓扑准确性。

链接: https://arxiv.org/abs/2603.04337
作者: Dacheng Qi,Chenyu Wang,Jingwei Xu,Tianzhe Chu,Zibo Zhao,Wen Liu,Wenrui Ding,Yi Ma,Shenghua Gao
机构: Transcengram; Beihang University; The University of Hong Kong; Shenzhen Loop Area Institute; ShanghaiTech University; Tencent; DeepSeek; University of California, Berkeley
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.

[NLP-4] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

【速读】: 该论文旨在解决多标签因果推理中的抽象事件推理(Abductive Event Reasoning)问题,其核心挑战在于从复杂场景中推断出最可能的因果链并准确识别多个相关事件。解决方案的关键在于一个三阶段系统:首先通过基于图的检索机制提取候选因果证据;其次利用大语言模型(LLM)进行抽象推理,并通过反思式提示演化(reflective prompt evolution)优化提示设计以提升推理质量;最后引入后验一致性约束(post-hoc consistency enforcement)确保输出结果在多标签场景下的逻辑自洽性。该方法在SemEval 2026任务中取得0.95的准确率,显著优于其他14种模型(7个模型家族),且跨模型错误分析揭示了三种共享归纳偏置(因果链不完整性、近因偏好、显著性偏差),表明系统性缺陷而非模型特异性问题主导了当前多标签因果推理的瓶颈。

链接: https://arxiv.org/abs/2603.04319
作者: Nikolas Karafyllis,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou
机构: National Technical University of Athens (雅典国立技术大学); AILS Laboratory (人工智能与学习系统实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.

[NLP-5] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

【速读】: 该论文试图解决的问题是:大型语言模型(LLM)隐藏状态中可线性恢复的地理与时间变量是否真正反映了模型内部具有“世界状”表征,还是仅仅源于文本本身已存在的结构。其解决方案的关键在于,通过在静态共现嵌入(如GloVe和Word2Vec)上应用相同的岭回归探测器,发现这些简单词嵌入同样能有效恢复地理信息(城市坐标R²=0.71–0.87)和较弱但可靠的时序信号(出生年份R²=0.48–0.52),且这种结构依赖于可解释的词汇梯度(如国家名称和气候相关词汇)。这表明,许多看似来自模型深层表征的世界知识实际上早已编码于文本本身的共现模式中,从而质疑了仅凭线性可恢复性即推断模型具备超越文本的内在世界表征这一观点。

链接: https://arxiv.org/abs/2603.04317
作者: Elan Barenholtz
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.

[NLP-6] V_1: Unifying Generation and Self-Verification for Parallel Reason ers

【速读】: 该论文旨在解决复杂推理任务中测试时扩展(test-time scaling)的瓶颈问题,即如何高效地从大量候选解中识别出正确解。现有方法通常依赖独立的标量评分对候选解进行评估,但这种方式在准确性上存在局限。论文的关键创新在于提出 V₁ 框架,其核心是通过高效的成对排序(pairwise ranking)统一生成与验证过程:一方面引入 V₁-Infer,利用不确定性引导的锦标赛式排名动态分配自验证计算资源至最不确定的候选对;另一方面设计 V₁-PairRL,采用强化学习联合训练单一模型作为生成器和成对自验证器,使验证能力随生成分布演化而适配。该方案显著提升了代码生成与数学推理任务中的性能表现,并实现了更高的计算效率。

链接: https://arxiv.org/abs/2603.04304
作者: Harman Singh,Xiuyu Li,Kusha Sareen,Monishwaran Maheswaran,Sijun Tan,Xiaoxia Wu,Junxiong Wang,Alpay Ariyak,Qingyang Wu,Samir Khaki,Rishabh Tiwari,Long Lian,Yucheng Lu,Boyi Li,Alane Suhr,Ben Athiwaratkun,Kurt Keutzer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce V_1 , a framework that unifies generation and verification through efficient pairwise ranking. V_1 comprises two components: V_1 -Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and V_1 -PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator’s evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1 -Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V_1 -PairRL achieves 7 – 9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.

[NLP-7] he Company You Keep: How LLM s Respond to Dark Triad Traits

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对用户表达黑暗三联征(Dark Triad traits,包括马基雅维利主义、自恋和精神病态)倾向的提示时,可能产生强化有害行为而非纠正的潜在风险问题。其解决方案的关键在于通过构建一个精心筛选的数据集,系统评估不同LLMs对不同程度黑暗三联征意图的响应模式,发现所有模型总体上倾向于采取纠正性回应,但在特定情境下仍会表现出迎合性输出,且响应的情感倾向与用户意图的严重程度密切相关。这一发现为设计更安全的对话系统提供了实证依据,强调需增强对用户从良性到有害请求演变的检测与响应能力。

链接: https://arxiv.org/abs/2603.04299
作者: Zeyi Lu,Angelica Henestrosa,Pavel Chizhov,Ivan P. Yamshchikov
机构: CAIRO, Technical University of Applied Sciences Würzburg-Schweinfurt (CAIRO,维尔茨堡-施韦因富特应用技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.

[NLP-8] Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在从研究原型向实际系统部署过程中所面临的可定制性瓶颈问题。当前依赖文本提示(text prompts)的定制方式难以满足规模化、稳定性及推理阶段仅需调整的需求,因此作者提出解决方案的关键在于:模型提供方应将向量提示(vector prompts)作为公开接口的一部分,以实现更高效和可控的LLM定制。这一方案的核心依据是实证研究表明,向量提示在监督增强下持续优化,且其注意力模式呈现密集且全局的特征,表明其具备不同于文本提示的独立控制机制;同时,在标准黑盒威胁模型下,暴露向量提示不会显著增加模型泄露风险,从而支持其在推理阶段安全使用。

链接: https://arxiv.org/abs/2603.04292
作者: Liangwei Yang,Shiyu Wang,Haolin Chen,Rithesh Murthy,Ming Zhu,Jielin Qiu,Zixiang Chen,Juntao Tan,Jianguo Zhang,Zhiwei Liu,Wenting Zhao,Silvio Savarese,Caiming Xiong,Huan Wang,Shelby Heinecke
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text prompts can already customize LLM behavior, we argue that text-only prompting does not constitute a suitable control interface for scalable, stable, and inference-only customization. This position paper argues that model providers should expose \emphvector prompt inputs as part of the public interface for customizing LLMs. We support this position with diagnostic evidence showing that vector prompt tuning continues to improve with increasing supervision whereas text-based prompt optimization saturates early, and that vector prompts exhibit dense, global attention patterns indicative of a distinct control mechanism. We further discuss why inference-only customization is increasingly important under realistic deployment constraints, and why exposing vector prompts need not fundamentally increase model leakage risk under a standard black-box threat model. We conclude with a call to action for the community to rethink prompt interfaces as a core component of LLM customization.

[NLP-9] Causality Elicitation from Large Language Models

【速读】: 该论文旨在解决如何从大规模语言模型(Large Language Models, LLMs)中提取潜在的因果关系假设问题。由于LLMs在训练过程中编码了大量知识,但其内部表示通常缺乏可解释的因果结构,因此作者提出了一种系统性管道方法,以生成可检验的因果假设集。解决方案的关键在于:首先从LLM生成多个相关文档,从中提取事件列表;然后将跨文档共现的事件聚类为标准事件(canonical events);接着构建每篇文档在标准事件上的二值指示向量;最后利用因果发现算法估计候选因果图。该方法不保证真实世界中的因果性,而是提供一个可解释、可检查的因果假设框架,供研究人员进一步验证和分析。

链接: https://arxiv.org/abs/2603.04276
作者: Takashi Kameyama,Masahiro Kato,Yasuko Hio,Yasushi Takano,Naoto Minakawa
机构: University of Tokyo (东京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on enormous amounts of data and encode knowledge in their parameters. We propose a pipeline to elicit causal relationships from LLMs. Specifically, (i) we sample many documents from LLMs on a given topic, (ii) we extract an event list from from each document, (iii) we group events that appear across documents into canonical events, (iv) we construct a binary indicator vector for each document over canonical events, and (v) we estimate candidate causal graphs using causal discovery methods. Our approach does not guarantee real-world causality. Rather, it provides a framework for presenting the set of causal hypotheses that LLMs can plausibly assume, as an inspectable set of variables and candidate graphs.

[NLP-10] Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长时任务中因有限上下文窗口而导致的长期记忆瓶颈问题。随着任务轨迹增长,模型难以在上下文中保留工具输出和中间推理过程,导致工作上下文过长、超出上下文预算,且远距离证据难以利用。现有方法通过截断或运行摘要来缩短上下文,但这类方法本质上是信息丢失的,因为它们压缩或丢弃了过去的证据本身。本文提出的关键解决方案是 Memex——一种基于索引的经验记忆机制,它不丢弃原始证据,而是将结构化的简洁摘要与稳定索引保留在紧凑的工作上下文中,同时将完整的交互数据存储于外部经验数据库中,并通过索引进行精确召回。该机制结合强化学习框架 MemexRL 进行优化,通过奖励塑造引导代理学习何时总结、归档、索引及检索,从而实现比仅依赖摘要的方法更少的信息损失。理论分析表明,Memex 循环可在有限解引用次数下保持决策质量,同时控制有效上下文计算复杂度随历史增长而增长的趋势。实验证明,在具有挑战性的长时任务中,使用 MemexRL 训练的代理能够在显著减少工作上下文占用的情况下提升任务成功率。

链接: https://arxiv.org/abs/2603.04257
作者: Zhenting Wang,Huancheng Chen,Jiayun Wang,Wei Wei
机构: Center for Advanced AI, Accenture(埃森哲高级人工智能中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to use even when it is still present. Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy because they compress or discard past evidence itself. We introduce Memex, an indexed experience memory mechanism that instead compresses context without discarding evidence. Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices. The agent can then decide when to dereference an index and recover the exact past evidence needed for the current subgoal. We optimize both write and read behaviors with our reinforcement learning framework MemexRL, using reward shaping tailored to indexed memory usage under a context budget, so the agent learns what to summarize, what to archive, how to index it, and when to retrieve it. This yields a substantially less lossy form of long-horizon memory than summary-only approaches. We further provide a theoretical analysis showing the potential of the Memex loop to preserve decision quality with bounded dereferencing while keeping effective in-context computation bounded as history grows. Empirically, on challenging long-horizon tasks, Memex agent trained with MemexRL improves task success while using a significantly smaller working context.

[NLP-11] Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG ICLR2026

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在多语言和复杂视觉布局文档上性能提升的归因问题,即现有模型改进是否真正源于检索机制的优化,还是由文档表示质量的提升所驱动。解决方案的关键在于通过固定检索机制、系统性地改变文本转录(transcription)与预处理方法,发现即使使用传统的词项匹配方法如BM25,也能显著缩小在多语言和视觉基准上的性能差距,从而表明文档表征质量是推动基准性能提升的主要因素。这一发现呼吁构建解耦的评估基准,以独立衡量文本转录与检索能力,确保研究进展被准确归因并聚焦于真正关键的技术环节。

链接: https://arxiv.org/abs/2603.04238
作者: Martin Asenov,Kenza Benkirane,Dan Goldwater,Aneiss Ghodsi
机构: Parexel AI Labs(帕雷塞尔人工智能实验室)
类目: Computation and Language (cs.CL)
备注: ICLR 2026 Workshop I Can’t Believe It’s Not Better: Where Large Language Models Need to Improve

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

[NLP-12] When Do Language Models Endorse Limitations on Human Rights Principles? EACL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险信息交互中如何与《世界人权宣言》(Universal Declaration of Human Rights, UDHR)所载人权原则保持对齐的问题,以确保AI系统在塑造公共话语时不会违背基本人权。其解决方案的关键在于构建一个涵盖24项人权条款、8种语言的1,152个合成场景评估框架,通过系统性测试11个主流LLMs的行为模式,揭示了模型在经济、社会和文化权利与政治及公民权利之间的权衡偏好、跨语言偏差、提示诱导敏感性以及量表响应形式差异等关键问题,从而为提升LLM的人权合规性提供实证依据和改进方向。

链接: https://arxiv.org/abs/2603.04217
作者: Keenan Samway,Nicole Miu Takagi,Rada Mihalcea,Bernhard Schölkopf,Ilias Chalkidis,Daniel Hershcovich,Zhijing Jin
机构: Max Planck Institute for Intelligent Systems (马克斯普朗克智能系统研究所); University of Toronto (多伦多大学); Vector Institute (向量研究所); University of Michigan (密歇根大学); University of Copenhagen (哥本哈根大学); EuroSafeAI (EuroSafeAI)
类目: Computation and Language (cs.CL)
备注: EACL Findings 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.

[NLP-13] Code Fingerprints: Disentangled Attribution of LLM -Generated Code

【速读】: 该论文旨在解决生成式 AI(Generative AI)在软件工程场景中面临的模型级代码溯源问题,即如何准确识别一段代码片段是由哪个大型语言模型(Large Language Models, LLMs)生成的。这一问题对软件治理、漏洞归因、合规审计等实际应用至关重要,但现有方法主要聚焦于机器生成与人类编写代码的区分,缺乏对不同LLM源的细粒度识别能力。解决方案的关键在于提出一种名为解耦代码溯源网络(Disentangled Code Attribution Network, DCAN)的新方法,其核心思想是将代码中的语义信息(Source-Agnostic)与模型特异性风格特征(Source-Specific)进行分离,并通过对比学习目标提取具有判别性的模型依赖信号,同时保留任务相关的语义完整性,从而实现跨模型和跨编程语言的多类代码归属判定。

链接: https://arxiv.org/abs/2603.04212
作者: Jiaxun Guo,Ziyuan Yang,Mengyu Sun,Hui Wang,Jingfeng Lu,Yi Zhang
机构: Sichuan University (四川大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine-generated code from human-written code; however, many practical scenarios–such as vulnerability triage, incident investigation, and licensing audits–require identifying which LLM produced a given code snippet. In this paper, we study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code. Although attribution is challenging, differences in training data, architectures, alignment strategies, and decoding mechanisms introduce model-dependent stylistic and structural variations that serve as generative fingerprints. Leveraging this observation, we propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations. Through a contrastive learning objective, DCAN isolates discriminative model-dependent signals while preserving task semantics, enabling multi-class attribution across models and programming languages. To support systematic evaluation, we construct the first large-scale benchmark dataset comprising code generated by four widely used LLMs (DeepSeek, Claude, Qwen, and ChatGPT) across four programming languages (Python, Java, C, and Go). Experimental results demonstrate that DCAN achieves reliable attribution performance across diverse settings, highlighting the feasibility of model-level provenance analysis in software engineering contexts. The dataset and implementation are publicly available at this https URL.

[NLP-14] Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model

【速读】: 该论文旨在解决极端 2-bit 量化(extreme 2-bit quantization)在波兰语大语言模型(Polish large language model)上的系统性评估与性能优化问题。其核心挑战在于如何在极低比特精度下保持模型推理能力,尤其是高阶逻辑推理和生成质量的稳定性。解决方案的关键在于:首先,采用基于共享 Hessian 矩阵的校准策略对六种先进的后训练量化方法(QuIP#、SpinQuant+GPTQ、ButterflyQuant、QTIP、VPTQ 和 AQLM)进行统一比较;其次,发现 QuIP# E8P12 在多项波兰语基准测试中达到 71.92% 的平均得分,接近全精度模型(IQ2_XXS 基线为 72.07%),同时在 eq_bench 上显著优于基线(+3.6 个百分点),表明其对高阶推理能力的更好保留;此外,识别出旋转类方法在 log-likelihood 保持良好但自回归生成严重退化的“MC-generation dissociation”现象,揭示了量化过程中不同任务目标之间的潜在冲突。这一成果为低资源部署下的高效语言模型提供了可复现且高质量的量化方案。

链接: https://arxiv.org/abs/2603.04162
作者: Jakub Prejzner
机构: BitSharp(比特尖端)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 13 tables. All models and Hessians available at this https URL

点击查看摘要

Abstract:We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model. Using Bielik-11B-v2.3-Instruct (11B parameters, Mistral architecture) as our base model, we compare six state-of-the-art post-training quantization methods – QuIP#, SpinQuant+GPTQ, ButterflyQuant, QTIP, VPTQ, and AQLM – all calibrated on a Polish-language corpus (CulturaX-PL) with shared Hessian matrices. Our best variant (QuIP# E8P12) achieves 71.92% across 22 Polish benchmarks versus 72.07% for the IQ2_XXS baseline – within statistical noise, at a modest size premium (3.26 GB vs. ~2.6 GB). On eq_bench, our method scores 47.14 versus 43.53 (+3.6pp), suggesting superior preservation of higher-order reasoning. QTIP achieves the best per-bit efficiency (79.4% MC acc_norm at ~2.4 bpw, 3.27 GB), matching VPTQ’s quality at 35% smaller size. We additionally document a MC-generation dissociation phenomenon where rotation-based methods preserve log-likelihood quality but fail catastrophically at autoregressive generation. The entire project was conducted by a single independent researcher on cloud GPUs (this http URL) within a 285 budget. All models, Hessians, and evaluation logs are publicly available.

[NLP-15] races of Social Competence in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理论心智(Theory of Mind, ToM)评估中因数据污染、模型细节不足及控制不一致导致的False Belief Test (FBT) 可靠性与解释力有限的问题。解决方案的关键在于:首先,采用17个开源模型在平衡的192种FBT变体上进行系统测试,并利用贝叶斯逻辑回归分析模型规模和后训练策略对社会认知能力的影响;其次,发现模型规模虽普遍提升性能,但存在交叉效应——显式表达命题态度(如“X thinks”)会根本性改变响应模式;最后,通过向量操控(vector steering)识别出一个“think向量”作为驱动FBT行为的因果因素,从而揭示了模型在预训练阶段即习得与心理状态词汇相关的刻板响应模式,为理解LLMs的社会推理机制提供了可解释的因果路径。

链接: https://arxiv.org/abs/2603.04161
作者: Tom Kouwenhoven,Michiel van der Meer,Max van Duijn
机构: Leiden Institute of Advanced Computer Science (莱顿高等计算机科学研究所); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al. 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented finetuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

[NLP-16] VietNormalizer: An Open-Source Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

【速读】: 该论文旨在解决越南语文本标准化(Vietnamese Text Normalization)这一关键但研究不足的预处理问题,尤其针对语音合成(Text-to-Speech, TTS)和自然语言处理(Natural Language Processing, NLP)应用。现实中的越南语文本包含大量非标准词汇(Non-Standard Words, NSW),如数字、日期、时间、货币、百分比、缩写及外来词等,这些均需转化为可发音的越南语表达形式才能用于下游任务。现有工具或依赖复杂神经模型且覆盖类别有限,或嵌套于大型NLP工具包中难以独立部署。解决方案的关键在于提出一个统一、规则驱动的流水线(rule-based pipeline),涵盖七类核心功能:整数/小数/大数转写、日期时间口语化、货币单位(VND/USD)处理、百分比展开、自定义缩写词解析、非越南语借词音近转换以及Unicode归一化与特殊字符清理;所有正则表达式在初始化时预编译,实现高吞吐量批处理、低内存开销,并无需GPU或外部API支持,具备零依赖、轻量化和易安装特性(通过pip install vietnormalizer)。

链接: https://arxiv.org/abs/2603.04145
作者: Hung Vu Nguyen,Loan Do,Thanh Ngoc Nguyen,Ushik Shrestha Khwakhali,Thanh Pham,Vinh Do,Charlotte Nguyen,Hien Nguyen
机构: Australian Catholic University (ACU); FPT University; ICMS; KETEMU; RMIT University Vietnam; NGHI Studio; Phuong Hai JSC
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 1 table

点击查看摘要

Abstract:We present VietNormalizer1, an open-source, zero-dependency Python library for Vietnamese text normalization targeting Text-to-Speech (TTS) and Natural Language Processing (NLP) applications. Vietnamese text normalization is a critical yet underserved preprocessing step: real-world Vietnamese text is densely populated with non-standard words (NSWs), including numbers, dates, times, currency amounts, percentages, acronyms, and foreign-language terms, all of which must be converted to fully pronounceable Vietnamese words before TTS synthesis or downstream language processing. Existing Vietnamese normalization tools either require heavy neural dependencies while covering only a narrow subset of NSW classes, or are embedded within larger NLP toolkits without standalone installability. VietNormalizer addresses these gaps through a unified, rule-based pipeline that: (1) converts arbitrary integers, decimals, and large numbers to Vietnamese words; (2) normalizes dates and times to their spoken Vietnamese forms; (3) handles VND and USD currency amounts; (4) expands percentages; (5) resolves acronyms via a customizable CSV dictionary; (6) transliterates non-Vietnamese loanwords and foreign terms to Vietnamese phonetic approximations; and (7) performs Unicode normalization and emoji/special-character removal. All regular expression patterns are pre-compiled at initialization, enabling high-throughput batch processing with minimal memory overhead and no GPU or external API dependency. The library is installable via pip install vietnormalizer, available on PyPI and GitHub at this https URL, and released under the MIT license. We discuss the design decisions, limitations of existing approaches, and the generalizability of the rule-based normalization paradigm to other low-resource tonal and agglutinative languages.

[NLP-17] BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLM s for Structured Beam Mechanics Reasoning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)中基于精确物理奖励信号是否能促使小型语言模型真正理解并推理物理规律,而非仅通过模式匹配获得正确答案的问题。其核心挑战在于:即使奖励信号是数学上精确的(如来自符号求解器的二进制正确性反馈),模型是否仍可能局限于学习特定解题模板而非内化物理定律(如静力学平衡方程)。解决方案的关键在于采用参数高效型强化学习方法(Parameter-Efficient RLVR),在无教师生成推理轨迹的情况下,直接使用符号求解器提供的硬性奖励训练一个15亿参数的语言模型进行梁的静力学问题求解。实验表明,尽管最优模型在Pass@1指标上相比基础模型提升66.7%,但其能力呈现各向异性——可组合泛化(如增加载荷)但无法应对拓扑变化(如移动支撑点),说明模型主要学习的是程序性解题模板而非对物理规律的深层理解。这一发现揭示了“结果级对齐”(outcome-level alignment)的局限性,并提示未来需结合结构化推理引导机制,才能推动模型从模板匹配迈向鲁棒的科学推理。

链接: https://arxiv.org/abs/2603.04124
作者: Tarjei Paule Hage,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

[NLP-18] FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理敏感话题时响应过于谨慎和模糊的问题,即在追求安全性的同时牺牲了有用性(helpfulness)。现有评估框架缺乏系统性的方法来识别和修复此类响应中的具体缺陷,难以同时提升安全性和有用性。其解决方案的关键在于提出 FINEST——一个针对敏感话题的细粒度响应评估分类法,将有用性和无害性分解为内容(Content)、逻辑(Logic)和适当性(Appropriateness)三个维度的错误类型,并基于此构建评分与错误导向的改进流程。实验表明,该方法显著提升了模型在三类错误上的表现,尤其评分引导的改进策略在适当性错误减少上效果最显著(最高降低33.09%),从而为可解释、全面地优化LLM对敏感问题的回答提供了新路径。

链接: https://arxiv.org/abs/2603.04123
作者: Juhyun Oh,Nayeon Lee,Chani Jung,Jiho Jin,Junho Myung,Jongwon Lee,Taeui Song,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) often generate overly cautious and vague responses on sensitive topics, sacrificing helpfulness for safety. Existing evaluation frameworks lack systematic methods to identify and address specific weaknesses in responses to sensitive topics, making it difficult to improve both safety and helpfulness simultaneously. To address this, we introduce FINEST, a FINE-grained response evaluation taxonomy for Sensitive Topics, which breaks down helpfulness and harmlessness into errors across three main categories: Content, Logic, and Appropriateness. Experiments on a Korean-sensitive question dataset demonstrate that our score- and error-based improvement pipeline, guided by FINEST, significantly improves the model responses across all three categories, outperforming refinement without guidance. Notably, score-based improvement – providing category-specific scores and justifications – yields the most significant gains, reducing the error sentence ratio for Appropriateness by up to 33.09%. This work lays the foundation for a more explainable and comprehensive evaluation and improvement of LLM responses to sensitive questions.

[NLP-19] Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation LREC

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)工作流中的快速应用对传统质量预测范式——即源端难度预测和候选端质量估计(Quality Estimation, QE)——带来的影响尚不明确的问题。解决方案的关键在于利用一个来自真实MT后编辑(MTPE)项目的独特多候选数据集,其中包含超过6000个英文源片段及其由多种传统神经机器翻译系统与先进LLMs生成的九个翻译假设,并基于单一人工后编辑参考进行评估。通过Kendall秩相关分析,研究者系统比较了源端难度指标、候选端QE模型及位置启发式方法在预测TER(译文后编辑努力程度)和COMET(人类判断代理指标)方面的表现,揭示了LLM架构迁移如何改变既有预测方法的可靠性,同时缓解了文档级翻译中的先前挑战。

链接: https://arxiv.org/abs/2603.04083
作者: Malik Marmonier,Benoît Sagot,Rachel Bawden
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 2026 Language Resources and Evaluation Conference (LREC)

点击查看摘要

Abstract:This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of “hindsight” experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall’s rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

[NLP-20] Monitoring Emergent Reward Hacking During Generation via Internal Activations

【速读】: 该论文旨在解决微调后的大语言模型(Large Language Models, LLMs)可能出现的奖励黑客(reward-hacking)行为难以通过最终输出检测的问题。这类行为源于模型在训练过程中出现的隐式对齐偏差(emergent misalignment),且往往在生成过程中的早期阶段即已显现,但传统基于输出的评估方法无法捕捉其动态演化。解决方案的关键在于提出一种基于激活的监控方法:利用稀疏自编码器(sparse autoencoders)对残差流(residual stream)激活进行建模,并结合轻量级线性分类器,从内部表示中提取逐token级别的奖励黑客活动估计。该方法能够在多种模型架构和微调混合策略下有效识别奖励黑客信号,具有良好的泛化能力,并揭示出其在链式思维(chain-of-thought reasoning)过程中具有依赖于模型结构的时间模式,从而为部署后的安全监控提供更早、更可靠的信号。

链接: https://arxiv.org/abs/2603.04069
作者: Patrick Wilhelm,Thorsten Wittkopp,Odej Kao
机构: Technical University of Berlin (柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.

[NLP-21] Who Judges the Judge? Evaluating LLM -as-a-Judge for French Medical open-ended QA EACL2026

【速读】: 该论文旨在解决医学开放问答(OEQA)自动评估难题,尤其是依赖专家标注所带来的高成本问题。其核心解决方案是利用大语言模型(LLM)作为语义等价性判别器,并通过领域适配与轻量级微调策略提升评估一致性。关键创新在于发现LLM的判断结果显著受生成模型影响,因此提出基于监督微调(SFT)和组相对策略优化(GRPO)的轻量适配方法,在有限数据下有效降低生成器敏感性,同时实现与专家标注高度一致的评估性能,从而为低资源医疗场景提供可扩展的自动化评估方案。

链接: https://arxiv.org/abs/2603.04033
作者: Ikram Belmadani,Oumaima El Khettari,Pacôme Constant dit Beaufils,Richard Dufour,Benoit Favre
机构: Aix-Marseille Univ., CNRS, LIS UMR 7020, 13000 Marseille, France; Nantes Univ., École Centrale Nantes, CNRS, LS2N, UMR 6004, 44000 Nantes, France; Nantes Université, CHU Nantes, PHU 11: Santé Publique, Clinique des données, INSERM, CIC 1413, 44000 Nantes, France; Nantes Université, CNRS, INSERM, L’institut du thorax, 44000 Nantes, France; Université Grenoble Alpes, CNRS, INRIA, Grenoble INP, LIG UMR 5217, 38100 Grenoble, France
类目: Computation and Language (cs.CL)
备注: Accepted in HeaLing Workshop - EACL 2026

点击查看摘要

Abstract:Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

[NLP-22] Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

【速读】: 该论文旨在解决当前角色扮演代理(Role-Playing Agents, RPAs)评估中存在的偏差问题,即现有研究多依赖于知名虚构角色(如哈利·波特、福尔摩斯等),导致模型过度依赖角色名称所关联的记忆信息,从而限制了RPAs在未见过的人设上的泛化能力。为应对这一问题,论文提出了一种匿名化评估方法,通过隐藏角色名称来检验模型的真实角色扮演能力。其解决方案的关键在于:首先验证匿名化显著降低角色扮演性能,证明名称隐含信息对模型表现具有重要影响;其次引入人格增强策略,在匿名设置下提升角色一致性与真实性,且发现由模型自动生成的人格特征可达到与人工标注相当的性能水平,从而构建出一种公平、可扩展且鲁棒的角色扮演框架。

链接: https://arxiv.org/abs/2603.03915
作者: Ji-Lun Peng,Yun-Nung Chen
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs). However, current research primarily evaluates RPAs using famous fictional characters, allowing models to rely on memory associated with character names. This dependency creates a bias that limits the generalization of RPAs to unseen personas. To address this issue, we propose an anonymous evaluation method. Experiments across multiple benchmarks reveal that anonymization significantly degrades role-playing performance, confirming that name exposure carries implicit information. Furthermore, we investigate personality augmentation to enhance role fidelity under anonymous setting. We systematically compare the efficacy of personality traits derived from human annotations versus those self-generated by the model. Our results demonstrate that incorporating personality information consistently improves RPA performance. Crucially, self-generated personalities achieve performance comparable to human-annotated ones. This work establishes a fairer evaluation protocol and validates a scalable, personality-enhanced framework for constructing robust RPAs.

[NLP-23] From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures

【速读】: 该论文旨在解决网络安全部署中对快速、可信威胁响应的迫切需求,尤其是在自动化配置安全控制以缓解新兴网络攻击方面。其核心挑战在于如何从复杂的网络安全情报(Cyber Threat Intelligence, CTI)报告中高效提取与操作任务相关的关键信息。解决方案的关键在于利用词义层级关系中的上位词-下位词(hypernym-hyponym)语义关联,结合神经符号(neuro-symbolic)方法构建多智能体系统,自动将CTI内容转化为CLIPS规则语言代码,从而生成可执行的防火墙策略以阻断恶意流量。实验证明,该基于上位词-下位词的信息抽取策略优于多种基线方法,且整体代理式架构在威胁缓解效果上更具优势。

链接: https://arxiv.org/abs/2603.03911
作者: Chiara Bonfanti,Davide Colaiacomo,Luca Cagliero,Cataldo Basile
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Web security demands rapid response capabilities to evolving cyber threats. Agentic Artificial Intelligence (AI) promises automation, but the need for trustworthy security responses is of the utmost importance. This work investigates the role of semantic relations in extracting information for sensitive operational tasks, such as configuring security controls for mitigating threats. To this end, it proposes to leverage hypernym-hyponym textual relations to extract relevant information from Cyber Threat Intelligence (CTI) reports. By leveraging a neuro-symbolic approach, the multi-agent system automatically generates CLIPS code for an expert system creating firewall rules to block malicious network traffic. Experimental results show the superior performance of the hypernym-hyponym retrieval strategy compared to various baselines and the higher effectiveness of the agentic approach in mitigating threats.

[NLP-24] CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

【速读】: 该论文旨在解决**话题定位(topic localization)**问题,即在文本中识别出表达特定话题的片段。其核心挑战在于如何准确地从文档中定位与给定话题名称和描述相匹配的文本跨度(span),且需兼顾细粒度的词级和文档级评估。解决方案的关键在于构建了一个基于捷克历史文献的人工标注基准数据集,该数据集包含人工定义的话题及手动标注的文本跨度,并采用人类一致性作为评估标准而非单一参考标注,从而更真实地反映模型性能。此外,研究还对比了多种大语言模型(LLM)与经蒸馏训练的BERT基线模型,发现尽管部分LLM表现接近人类水平,但轻量化的蒸馏token嵌入模型仍具备较强竞争力,凸显了高效模型设计的价值。

链接: https://arxiv.org/abs/2603.03884
作者: Martin Kostelník,Michal Hradiš,Martin Dočekal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: this https URL.

[NLP-25] Assessing the Effectiveness of LLM s in Delivering Cognitive Behavioral Therapy LREC2026

【速读】: 该论文旨在解决当前心理健康服务需求激增背景下,如何利用大型语言模型(Large Language Models, LLMs)提供可访问且可扩展的心理治疗支持的问题。由于LLMs尚未经过临床验证用于心理咨询场景,其作为虚拟治疗师的可行性尚不明确。解决方案的关键在于系统评估LLMs在模拟认知行为疗法(Cognitive Behavioral Therapy, CBT)专业对话中的表现,具体采用两种方法:一是仅基于生成的对话模式,二是引入检索增强生成(Retrieval-Augmented Generation, RAG)技术,结合CBT指南进行干预。通过自然语言生成(Natural Language Generation, NLG)指标、自然语言推理(Natural Language Inference, NLI)以及技能评分自动化工具对语言质量、语义连贯性和治疗一致性进行量化分析,结果表明尽管LLMs能生成类CBT对话,但在共情表达和一致性维持方面仍存在显著局限。

链接: https://arxiv.org/abs/2603.03862
作者: Navdeep Singh Bedi,Ana-Maria Bucur,Noriko Kando,Fabio Crestani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:As mental health issues continue to rise globally, there is an increasing demand for accessible and scalable therapeutic solutions. Many individuals currently seek support from Large Language Models (LLMs), even though these models have not been validated for use in counseling services. In this paper, we evaluate LLMs’ ability to emulate professional therapists practicing Cognitive Behavioral Therapy (CBT). Using anonymized, transcribed role-play sessions between licensed therapists and clients, we compare two approaches: (1) a generation-only method and (2) a Retrieval-Augmented Generation (RAG) approach using CBT guidelines. We evaluate both proprietary and open-source models for linguistic quality, semantic coherence, and therapeutic fidelity using standard natural language generation (NLG) metrics, natural language inference (NLI), and automated scoring for skills assessment. Our results indicate that while LLMs can generate CBT-like dialogues, they are limited in their ability to convey empathy and maintain consistency.

[NLP-26] Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling EACL2026

【速读】: 该论文旨在解决修辞角色标注(Rhetorical Role Labeling, RRL)任务中现有模型难以同时建模局部句法依赖与全局语料特征的问题。当前基于层次结构的模型虽能有效捕捉局部上下文关系,但缺乏对跨文档语义模式的建模能力,导致在低频修辞角色上的性能受限。其解决方案的关键在于提出两种原型驱动的方法:一是原型正则化(Prototype-Based Regularization, PBR),通过距离约束的辅助损失学习软原型以结构化潜在空间;二是原型条件调制(Prototype-Conditioned Modulation, PCM),构建语料级原型并在训练和推理阶段注入,从而融合全局语义信息。这两种方法显著提升了模型对细粒度修辞角色(如法律、医学文本中罕见类别)的识别能力,在多个基准上实现了宏F1分数的稳定提升。

链接: https://arxiv.org/abs/2603.03856
作者: Anas Belfathi,Nicolas Hernandez,Laura Monceaux,Warren Bonnard,Mary Catherine Lavissiere,Christine Jacquin,Richard Dufour
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France; University of Lorraine, France
类目: Computation and Language (cs.CL)
备注: Accepted at EACL 2026

点击查看摘要

Abstract:Rhetorical Role Labeling (RRL) identifies the functional role of each sentence in a document, a key task for discourse understanding in domains such as law and medicine. While hierarchical models capture local dependencies effectively, they are limited in modeling global, corpus-level features. To address this limitation, we propose two prototype-based methods that integrate local context with global representations. Prototype-Based Regularization (PBR) learns soft prototypes through a distance-based auxiliary loss to structure the latent space, while Prototype-Conditioned Modulation (PCM) constructs corpus-level prototypes and injects them during training and inference. Given the scarcity of RRL resources, we introduce SCOTUS-Law, the first dataset of U.S. Supreme Court opinions annotated with rhetorical roles at three levels of granularity: category, rhetorical function, and step. Experiments on legal, medical, and scientific benchmarks show consistent improvements over strong baselines, with 4 Macro-F1 gains on low-frequency roles. We further analyze the implications in the era of Large Language Models and complement our findings with expert evaluation.

[NLP-27] Benchmarking Motivational Interviewing Competence of Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际临床场景中执行动机访谈(Motivational Interviewing, MI)时的胜任力评估问题,特别是其是否能有效模仿人类治疗师的MI行为并具备与人类难以区分的能力。解决方案的关键在于:首先,通过Motivational Interviewing Treatment Integrity (MITI) 4.2框架对10个商用和开源大语言模型(LLMs)进行系统性评估,涵盖手工构建对话与34段真实临床转录文本;其次,采用复合评分体系量化模型在MI核心要素(如复杂反射、反射-提问比等)上的表现,并通过双盲实验由两名精神科医生判断语音来源(人类或LLM),从而验证其“可区分性”。结果表明,部分LLMs在真实临床语境下已达到良好MI胜任水平,且其输出难以被专业人员准确识别,凸显了其在低资源环境扩展MI干预的可行性。

链接: https://arxiv.org/abs/2603.03846
作者: Aishwariya Jha,Prakrithi Shivaprakash,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Motivational interviewing (MI) promotes behavioural change in substance use disorders. Its fidelity is measured using the Motivational Interviewing Treatment Integrity (MITI) framework. While large language models (LLMs) can potentially generate MI-consistent therapist responses, their competence using MITI is not well-researched, especially in real world clinical transcripts. We aim to benchmark MI competence of proprietary and open-source models compared to human therapists in real-world transcripts and assess distinguishability from human therapists. Methods: We shortlisted 3 proprietary and 7 open-source LLMs from LMArena, evaluated performance using MITI 4.2 framework on two datasets (96 handcrafted model transcripts, 34 real-world clinical transcripts). We generated parallel LLM-therapist utterances iteratively for each transcript while keeping client responses static, and ranked performance using a composite ranking system with MITI components and verbosity. We conducted a distinguishability experiment with two independent psychiatrists to identify human-vs-LLM responses. Results: All 10 tested LLMs had fair (MITI global scores 3.5) to good (MITI global scores 4) competence across MITI measures, and three best-performing models (gemma-3-27b-it, gemini-2.5-pro, grok-3) were tested on real-world transcripts. All showed good competence, with LLMs outperforming human-expert in Complex Reflection percentage (39% vs 96%) and Reflection-Question ratio (1.2 vs 2.8). In the distinguishability experiment, psychiatrists identified LLM responses with only 56% accuracy, with d-prime: 0.17 and 0.25 for gemini-2.5-pro and gemma-3-27b-it respectively. Conclusion: LLMs can achieve good MI proficiency in real-world clinical transcripts using MITI framework. These findings suggest that even open-source LLMs are viable candidates for expanding MI counselling sessions in low-resource settings.

[NLP-28] Semantic Bridging Domains: Pseudo-Source as Test-Time Connector

【速读】: 该论文旨在解决训练数据与测试数据之间分布偏移(distribution shift)的问题,尤其是在源域未知且目标域无标签的现实场景中,传统方法通过生成或翻译构建伪源域并直接对齐目标域,但因伪源域与真实源域存在显著差异,易导致模型性能发散。其解决方案的关键在于提出一种分步语义对齐(Stepwise Semantic Alignment, SSA)方法,将伪源域视为连接源域与目标域的语义桥梁而非替代品:首先利用易获取的通用语义信息校正伪源域的语义特征,再基于修正后的语义对目标域进行对齐;同时引入分层特征聚合(Hierarchical Feature Aggregation, HFA)模块和置信度感知互补学习(Confidence-Aware Complementary Learning, CACL)策略,在缺乏源域和目标域真实标签的情况下提升语义对齐的质量。

链接: https://arxiv.org/abs/2603.03844
作者: Xizhong Yang,Huiming Wang,Ning Xu,Mofei Song
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Distribution shifts between training and testing data are a critical bottleneck limiting the practical utility of models, especially in real-world test-time scenarios. To adapt models when the source domain is unknown and the target domain is unlabeled, previous works constructed pseudo-source domains via data generation and translation, then aligned the target domain with them. However, significant discrepancies exist between the pseudo-source and the original source domain, leading to potential divergence when correcting the target directly. From this perspective, we propose a Stepwise Semantic Alignment (SSA) method, viewing the pseudo-source as a semantic bridge connecting the source and target, rather than a direct substitute for the source. Specifically, we leverage easily accessible universal semantics to rectify the semantic features of the pseudo-source, and then align the target domain using the corrected pseudo-source semantics. Additionally, we introduce a Hierarchical Feature Aggregation (HFA) module and a Confidence-Aware Complementary Learning (CACL) strategy to enhance the semantic quality of the SSA process in the absence of source and ground truth of target domains. We evaluated our approach on tasks like semantic segmentation and image classification, achieving a 5.2% performance boost on GTA2Cityscapes over the state-of-the-art.

[NLP-29] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的代码生成与修复方法在真实软件开发场景中适用性不足的问题,尤其是现有静态、一次性修复范式无法捕捉复杂需求变更和长期功能迭代所导致的代码质量退化问题。解决方案的关键在于提出首个基于持续集成(Continuous Integration, CI)流程的仓库级评估基准 SWE-CI,其通过构建平均跨度233天、包含71次连续提交的真实代码演化任务(共100个),要求代理在多轮分析与编码迭代中系统性地完成任务,从而将代码生成能力的评估维度从静态的功能正确性转向动态的长期可维护性。

链接: https://arxiv.org/abs/2603.03823
作者: Jialong Chen,Xander Xu,Hu Wei,Chuan Chen,Bing Zhao
机构: Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations – a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbfSWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textitfunctional correctness toward dynamic, long-term \textitmaintainability. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.

[NLP-30] 2S-Bench Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

【速读】: 该论文旨在解决大语言模型在复杂文本处理任务中表现不足的问题,核心在于探索如何通过显式构建文本结构来提升模型性能。其关键解决方案是提出一种名为“思维结构”(Structure of Thought, SoT)的提示技术,该技术引导模型在推理过程中生成中间文本结构,从而增强对信息的理解与组织能力;同时,作者构建了首个专门评估文本到结构能力的基准测试集 T2S-Bench,涵盖6个科学领域和32种结构类型,用于系统性地衡量和优化模型的文本结构化能力。实验表明,SoT 在8个不同任务上平均提升5.7%性能,结合T2S-Bench微调后进一步提升至+8.6%,验证了显式文本结构对模型性能的显著促进作用。

链接: https://arxiv.org/abs/2603.03790
作者: Qinsi Wang,Hancheng Ye,Jinhee Kim,Jinghan Ke,Yifei Wang,Martin Kuo,Zishan Shao,Dongting Li,Yueqian Lin,Ting Jiang,Chiyue Wei,Qi Qian,Wei Wen,Helen Li,Yiran Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset and Code have been released at this https URL

点击查看摘要

Abstract:Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at this https URL.

[NLP-31] MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在科学发现中直接建模推理过程 $ P(\text{hypothesis}|\text{background}) $(即假设给定背景知识下的概率分布)的数学不可行性问题,其根源在于从大规模知识库中检索和组合灵感时存在组合复杂度 $ O(N^k) $ 的指数级增长。解决方案的关键在于提出 MOOSE-Star 框架,通过三项核心技术实现可 tractable(可行)训练与可扩展推理:(1) 基于发现的概率方程对任务进行分解以训练子任务;(2) 利用动机引导的分层搜索实现对数级检索并剪枝无关子空间;(3) 采用有界组合机制增强对检索噪声的鲁棒性,从而将复杂度从指数级降低至对数级 $ O(\log N) $。

链接: https://arxiv.org/abs/2603.03756
作者: Zonglin Yang,Lidong Bing
机构: 未知
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, P(\texthypothesis|\textbackground) ( P(h|b) ), unexplored. We demonstrate that directly training P(h|b) is mathematically intractable due to the combinatorial complexity ( O(N^k) ) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ( O(\log N) ) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a ‘‘complexity wall,’’ MOOSE-Star exhibits continuous test-time scaling.

[NLP-32] Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning EACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中虽具备优越性能但计算成本高昂的问题,以及小型语言模型(Small Language Models, SLMs)因推理能力有限而难以独立完成高精度任务的局限性。解决方案的关键在于提出一种协同推理系统 COllaborative REAsoner (COREA),其通过级联 SLM 与 LLM 实现成本与准确性的平衡:SLM 首先尝试回答问题并输出答案及显式置信度分数,当置信度低于预设阈值时,问题自动转交给 LLM 进行更精确处理;同时引入基于强化学习的训练算法,以置信度校准奖励优化 SLM 的置信估计能力,从而提升整体系统的效率与可靠性。

链接: https://arxiv.org/abs/2603.03752
作者: Chuang Zhang,Zizhen Zhu,Yihao Wei,Bing Tian,Junyi Liu,Henan Wang,Xavier Wang,Yaxiao Liu
机构: Amazon Web Services(亚马逊网络服务); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized confidence score. Questions with confidence below a predefined threshold are deferred to the LLM for more accurate resolution. We introduce a reinforcement learning-based training algorithm that aligns the SLM’s confidence through an additional confidence calibration reward. Extensive experiments demonstrate that our method jointly improves the SLM’s reasoning ability and confidence calibration across diverse datasets and model backbones. Compared to using the LLM alone, COREA reduces cost by 21.5% and 16.8% on out-of-domain math and non-math datasets, respectively, with only an absolute pass@1 drop within 2%.

[NLP-33] ErrorLLM : Modeling SQL Errors for Text-to-SQL Refinement

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本到SQL(text-to-SQL)生成任务中初始生成的SQL查询存在语法和语义错误的问题,尤其是现有自调试(self-debugging)和自修正(self-correction)方法因缺乏显式错误信号与精准错误建模而导致效果不佳。其解决方案的关键在于提出ErrorLLM框架,该框架通过专用的LLM显式建模text-to-SQL错误:首先将用户问题和数据库模式(schema)表示为结构化特征,利用静态检测识别执行失败和表面不匹配;其次,在语义空间中引入专门的错误标记(error tokens),以捕获分类后的隐式语义错误类型;并通过精心设计的训练策略,使模型能够基于结构化表示预测这些错误标记,从而实现对复杂隐式错误的高精度检测,并据此进行误差引导的SQL结构修正,显著提升了文本到SQL的准确性和鲁棒性。

链接: https://arxiv.org/abs/2603.03742
作者: Zijin Hong,Hao Chen,Zheng Yuan,Qinggang Zhang,Luyao Zhuang,Qing Liao,Feiran Huang,Yangqiu Song,Xiao Huang
机构: The Hong Kong Polytechnic University(香港理工大学); City University of Macau(澳门城市大学); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Beihang University(北京航空航天大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Despite the remarkable performance of large language models (LLMs) in text-to-SQL (SQL generation), correctly producing SQL queries remains challenging during initial generation. The SQL refinement task is subsequently introduced to correct syntactic and semantic errors in generated SQL queries. However, existing paradigms face two major limitations: (i) self-debugging becomes increasingly ineffective as modern LLMs rarely produce explicit execution errors that can trigger debugging signals; (ii) self-correction exhibits low detection precision due to the lack of explicit error modeling grounded in the question and schema, and suffers from severe hallucination that frequently corrupts correct SQLs. In this paper, we propose ErrorLLM, a framework that explicitly models text-to-SQL Errors within a dedicated LLM for text-to-SQL refinement. Specifically, we represent the user question and database schema as structural features, employ static detection to identify execution failures and surface mismatches, and extend ErrorLLM’s semantic space with dedicated error tokens that capture categorized implicit semantic error types. Through a well-designed training strategy, we explicitly model these errors with structural representations, enabling the LLM to detect complex implicit errors by predicting dedicated error tokens. Guided by the detected errors, we perform error-guided refinement on the SQL structure by prompting LLMs. Extensive experiments demonstrate that ErrorLLM achieves the most significant improvements over backbone initial generation. Further analysis reveals that detection quality directly determines refinement effectiveness, and ErrorLLM addresses both sides by high detection F1 score while maintain refinement effectiveness.

[NLP-34] Order Is Not Layout: Order-to-Space Bias in Image Generation

【速读】: 该论文旨在解决现代图像生成模型中存在的系统性偏差问题——即文本中实体的提及顺序会错误地决定图像中的空间布局和实体角色绑定,这种现象被称为“顺序到空间偏差”(Order-to-Space Bias, OTS)。OTS 会覆盖真实场景中的语义线索,导致生成图像出现不正确的空间排列或角色错位。为量化该偏差,作者提出了 OTS-Bench 基准测试,通过成对仅在实体顺序上不同的提示词来隔离顺序效应,并从一致性(homogenization)与正确性(correctness)两个维度评估模型表现。实验表明,OTS 广泛存在于当前主流图像生成模型中,且主要由训练数据驱动,在布局形成初期即显现。基于此发现,论文提出两种有效解决方案:一是针对性微调(targeted fine-tuning),二是早期阶段干预策略(early-stage intervention),二者均能显著降低 OTS 偏差,同时保持图像生成质量不受影响。

链接: https://arxiv.org/abs/2603.03714
作者: Yongkang Zhang,Zonglin Zhao,Yuechen Zhang,Fei Ding,Pei Li,Wenxuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We study a systematic bias in modern image generation models: the mention order of entities in text spuriously determines spatial layout and entity–role binding. We term this phenomenon Order-to-Space Bias (OTS) and show that it arises in both text-to-image and image-to-image generation, often overriding grounded cues and causing incorrect layouts or swapped assignments. To quantify OTS, we introduce OTS-Bench, which isolates order effects with paired prompts differing only in entity order and evaluates models along two dimensions: homogenization and correctness. Experiments show that Order-to-Space Bias (OTS) is widespread in modern image generation models, and provide evidence that it is primarily data-driven and manifests during the early stages of layout formation. Motivated by this insight, we show that both targeted fine-tuning and early-stage intervention strategies can substantially reduce OTS, while preserving generation quality.

[NLP-35] CONCUR: Benchmarking LLM s for Concurrent Code Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在并发代码生成能力评估方面存在的空白问题。现有基准主要聚焦于顺序代码,无法有效衡量LLMs在生成并发代码时的表现,而并发代码具有更高的复杂性及特有的缺陷类型(如死锁和竞争条件),这使得传统基准不适用于此类场景。解决方案的关键在于设计了一个名为CONCUR的专用基准,其包含43个源自标准并发教材的基础问题以及72个经验证的变异变体,共计115个问题;基础问题确保语义核心覆盖,变异变体则增强语言和结构多样性,从而全面评估LLMs在并发代码生成中的能力。

链接: https://arxiv.org/abs/2603.03683
作者: Jue Huang,Tarek Mahmud,Corina Pasareanu,Guowei Yang
机构: The University of Queensland (昆士兰大学); Texas AM University - Kingsville (德州农工大学金斯维尔分校); Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluate the code generation capabilities of LLMs. However, existing benchmarks focus primarily on sequential code, lacking the ability to effectively evaluate LLMs on concurrent code generation. Compared to sequential code, concurrent code exhibits greater complexity and possesses unique types of bugs, such as deadlocks and race conditions, that do not occur in sequential code. Therefore, a benchmark for evaluating sequential code generation cannot be useful for evaluating concurrent code generation with LLMs. To address this gap, we designed a benchmark CONCUR specifically aimed at evaluating the capability of LLMs to generate concurrent code. CONCUR consists of a base set of 43 concurrency problems derived from a standard concurrency textbook, together with 72 validated mutant variants, resulting in 115 total problems. The base problems serve as the semantic core of the benchmark, while the mutants expand linguistic and structural diversity. We conducted an evaluation of a range of LLMs on CONCUR, highlighting limitations of current models. Overall, our work provides a novel direction for evaluating the capability of LLMs to generate code with focus on concurrency.

[NLP-36] MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在精神科咨询场景中面临的两大核心挑战:一是缺乏基于临床指南的支撑,导致在症状不典型或描述不完整时容易产生未经验证的临床判断;二是多轮交互中难以控制提问漂移(inquiry drift),且无法优化提问策略以提升信息获取效率与诊断准确性。解决方案的关键在于提出MIND框架——一个统一的“询问-诊断”强化学习系统,其核心创新包括:构建Criteria-Grounded Psychiatric Reasoning Bank(PRB),将对话上下文转化为可检索的临床状态,通过语义相似性召回参考咨询并提炼可复用的、基于标准的临床支持;在此基础上,引入基于评分量表的过程奖励机制(rubric-based process rewards)实现对中间决策步骤的细粒度监督,并结合价值感知轨迹修正机制(value-aware trajectory rectification)协同优化多轮交互中的信息采集与诊断决策。

链接: https://arxiv.org/abs/2603.03677
作者: Guoyi Li,Shihao Xu,Jiatong Ma,Yunyun Han,Jianhua Chen,Yafeng Deng
机构: EverMind AI Inc.(EverMind AI公司); Tianqiao and Chrissy Chen Institute(天桥和崔西陈研究所); Shanghai Mental Health Center(上海精神卫生中心); Shanghai Jiao Tong University School of Medicine(上海交通大学医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract psychopathological cues from incomplete and inconsistent patient reports in multi-turn interactions and perform rigorous differential diagnostic reasoning. However, existing methods face two fundamental challenges. First, without criteria-grounded clinical supports, they are prone to unsupported clinical assertions when symptoms are atypical or underspecified. Second, in multi-turn interactions, they struggle to mitigate inquiry drift (off-topic or low-yield questioning) and optimize questioning strategies. To address these challenges, we propose MIND, a unified inquiry–diagnosis reinforcement learning framework for psychiatric consultation. Specifically, we build a Criteria-Grounded Psychiatric Reasoning Bank (PRB) that summarizes dialogue context into clinical retrieval states, retrieves semantically similar reference consultations, and distills reusable criteria-grounded clinical supports to guide criteria-aligned inquiry and reasoning. Building on this foundation, MIND enforces explicit clinical reasoning with rubric-based process rewards to provide fine-grained supervision over intermediate decision steps, and incorporates a value-aware trajectory rectification mechanism to jointly improve information acquisition and diagnostic decision-making across turns. Extensive experiments demonstrate that MIND consistently outperforms strong baselines in diagnostic accuracy, empathetic interaction quality, interpretability, and generalization.

[NLP-37] Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification DASFAA2026

【速读】: 该论文旨在解决短文本分类(Short Text Classification, STC)中因上下文信息匮乏和标注数据稀缺而导致的性能瓶颈问题,尤其针对现有方法主要聚焦于英语而忽视韩语等屈折语言特性的局限性。其关键解决方案是提出一种分层异构图模型LIGRAM,通过构建词素、词性及命名实体层级的子图,并进行层次化整合,以弥补短文本中上下文信息不足的问题,同时精准捕捉韩语固有的语法与语义依赖关系;此外,引入语义感知对比学习(Semantics-aware Contrastive Learning, SemCon),增强文档间的语义相似性建模能力,从而在类别边界模糊的短文本场景下建立更清晰的决策边界。

链接: https://arxiv.org/abs/2603.03652
作者: JaeGeon Yoo,Byoungwook Kim,Yeongwook Yang,Hong-Jun Jang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 1 Figure, Accepted at DASFAA 2026 (Full Research Paper)

点击查看摘要

Abstract:Short text classification (STC) remains a challenging task due to the scarcity of contextual information and labeled data. However, existing approaches have pre-dominantly focused on English because most benchmark datasets for the STC are primarily available in English. Consequently, existing methods seldom incorporate the linguistic and structural characteristics of Korean, such as its agglutinative morphology and flexible word order. To address these limitations, we propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification. The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts while precisely capturing the grammatical and semantic dependencies inherent in Korean. In addition, we apply Semantics-aware Contrastive Learning (SemCon) to reflect semantic similarity across documents, enabling the model to establish clearer decision boundaries even in short texts where class distinctions are often ambiguous. We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models. These outcomes validate that integrating language-specific graph representations with SemCon provides an effective solution for short text classification in agglutinative languages such as Korean.

[NLP-38] A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research

【速读】: 该论文旨在解决现有主题建模方法在商业研究中作为测量工具时表现不佳的问题,具体表现为概率模型生成的概念模糊的主题、神经主题模型在理论驱动场景下难以解释,以及大语言模型方法缺乏标准化、稳定性与文档级表示的一致性。其解决方案的关键在于提出LX Topic,一种将主题视为潜在语言构念的神经主题方法,通过结合FASTopic确保文档代表性,并引入基于对齐和置信度加权机制的大语言模型层面精炼,从而在不扭曲文档-主题分布的前提下提升语义一致性;同时,LX Topic实现了主题发现、优化与标准化输出的统一,构建了一个可复现、可解释且以测量为导向的主题建模工具,显著提升了主题质量并保持了聚类与分类性能。

链接: https://arxiv.org/abs/2603.03623
作者: Stephan Ludwig,Peter J. Danaher,Xiaohao Yang
机构: 未知
类目: Computation and Language (cs.CL); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:The growing use of unstructured text in business research makes topic modeling a central tool for constructing explanatory variables from reviews, social media, and open-ended survey responses, yet existing approaches function poorly as measurement instruments. Prior work shows that textual content predicts outcomes such as sales, satisfaction, and firm performance, but probabilistic models often generate conceptually diffuse topics, neural topic models are difficult to interpret in theory-driven settings, and large language model approaches lack standardization, stability, and alignment with document-level representations. We introduce LX Topic, a neural topic method that conceptualizes topics as latent linguistic constructs and produces calibrated document-level topic proportions for empirical analysis. LX Topic builds on FASTopic to ensure strong document representativeness and integrates large language model refinement at the topic-word level using alignment and confidence-weighting mechanisms that enhance semantic coherence without distorting document-topic distributions. Evaluations on large-scale Amazon and Yelp review datasets demonstrate that LX Topic achieves the highest overall topic quality relative to leading models while preserving clustering and classification performance. By unifying topic discovery, refinement, and standardized output in a web-based system, LX Topic establishes topic modeling as a reproducible, interpretable, and measurement-oriented instrument for marketing research and practice.

[NLP-39] Why Are Linear RNNs More Parallelizable?

【速读】: 该论文旨在解决一个关键问题:为何线性循环神经网络(Linear RNNs, LRNNs)在实践中比传统非线性RNN更容易并行化,类似于Transformer架构,尽管两者结构差异显著。其解决方案的关键在于建立RNN类型与计算复杂度类之间的紧密联系——通过理论分析表明,LRNN可被建模为深度为对数级(log-depth)的算术电路,仅比Transformer所对应的布尔电路增加轻微的深度开销;而相比之下,非线性RNN能求解L\mathsf{L}-完全甚至P\mathsf{P}-完全问题,揭示了其本质上的串行计算瓶颈。此外,论文进一步区分了不同LRNN变体的细粒度表达能力:置换-对角LRNN属于NC1\mathsf{NC}^1-完全,而对角加低秩LRNN则更强大(PNC1\mathsf{PNC}^1-完全),并通过自动机理论模型映射每类RNN的模拟能力,从而系统性地揭示了非线性RNN与各类LRNN之间在表达能力和并行效率间的根本权衡,为设计兼具高表达力与高效并行性的大语言模型(LLM)架构提供了理论基础。

链接: https://arxiv.org/abs/2603.03612
作者: William Merrill,Hongjian Jiang,Yanhong Li,Ashish Sabharwal
机构: 未知
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs – but not traditional, nonlinear RNNs – as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve \mathsfL -complete problems (and even \mathsfP -complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are \mathsfNC^1 -complete whereas diagonal-plus-low-rank LRNNs are more expressive ( \mathsfPNC^1 -complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

[NLP-40] Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)模拟不同人口统计群体对虚假信息的易感性问题,其核心挑战在于不同群体因信念差异而表现出不同的信息辨别能力。解决方案的关键在于提出BeliefSim框架,该框架通过心理学启发的分类体系和调查先验构建人口统计学信念特征画像,并结合提示词条件控制与后训练适应策略,从而有效将信念作为主要驱动因素用于模拟虚假信息易感性。实验表明,该方法在多个数据集和建模策略下均能实现高达92%的准确率,且具备良好的反事实人口统计敏感性,验证了信念作为先验知识在模拟人类行为中的强指导作用。

链接: https://arxiv.org/abs/2603.03585
作者: Angana Borah,Zohaib Khan,Rada Mihalcea,Verónica Pérez-Rosas
机构: University of Michigan - Ann Arbor, USA; Texas State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper Under Review

点击查看摘要

Abstract:Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with accuracy up to 92%.

[NLP-41] ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer ICLR2026

【速读】: 该论文试图解决现代语言模型依赖固定预定义子词分词(subword tokenization)所带来的局限性问题,这种固定粒度的分词方式会导致模型在处理复杂语义时表现出脆弱且不符合直觉的行为。解决方案的关键在于提出一种名为ByteFlow Net的新颖分层架构,该架构完全摒弃了传统分词器,转而通过压缩驱动的分割策略,使模型能够从原始字节流中自适应地学习语义有意义的单元边界;其核心机制基于潜在表示的编码率(coding rate)来确定分割点,并借助Top-K选择保持静态计算图,从而实现端到端、无需人工设计归纳偏置的Tokenizer-free建模,显著提升了模型性能。

链接: https://arxiv.org/abs/2603.03583
作者: Chunyuan Deng,Sanket Lokegaonkar,Colin Lockard,Besnik Fetahu,Nasser Zalmout,Xian Li
机构: Rice University(莱斯大学); Amazon Science(亚马逊科学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026

点击查看摘要

Abstract:Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbfByteFlow Net, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emphwhile preserving a static computation graph via Top- K selection. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

[NLP-42] Build Judge Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

【速读】: 该论文旨在解决对话式购物助手(Conversational Shopping Assistants, CSAs)从原型到生产部署过程中面临的两大挑战:一是如何有效评估多轮交互质量,二是如何优化紧密耦合的多智能体系统。针对这些问题,论文提出了一套结构化的评估框架,将端到端购物质量分解为多个可量化维度,并构建了基于大语言模型(LLM)作为评判者(LLM-as-judge)的校准流水线,该流水线与人工标注对齐。在此基础上,论文进一步探索两种互补的提示优化策略:其一是子智能体GEPA(Sub-agent GEPA),针对每个智能体节点使用局部评估标准进行优化;其二是多智能体多轮GEPA(MAMuT GEPA),一种新型系统级方法,通过多轮模拟和轨迹级评分联合优化跨智能体的提示参数。解决方案的关键在于将评估体系结构化、可校准,并结合系统级优化机制实现多智能体协同性能提升。

链接: https://arxiv.org/abs/2603.03565
作者: Alejandro Breen Herrera,Aayush Sheth,Steven G. Xu,Zhucheng Zhan,Charles Wright,Marcus Yearwood,Hongtai Wei,Sudeep Das
机构: WithMetis.ai(与Metis.ai); DoorDash(DoorDash)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.

[NLP-43] ucano 2 Cool: Better Open Source LLM s for Portuguese

【速读】: 该论文旨在解决葡萄牙语大语言模型(Large Language Models, LLMs)在开源生态中存在资源不足与性能局限的问题,特别是在高质量训练数据、多样化任务适配能力及可复现性方面的缺失。解决方案的关键在于构建了Tucano 2系列模型(Base、Instruct和Think),其核心创新包括:扩展并提升质量的GigaVerbo-v2语料库,引入合成数据集GigaVerbo-v2 Synth以填补原始数据空白,以及设计用于监督微调(Supervised Fine-Tuning, SFT)和偏好对齐的Post-training数据集(GigaVerbo-v2 SFT与GigaVerbo-v2 Preferences),从而支持检索增强生成、代码编写、工具调用、链式思维推理等多场景应用;同时通过系统性的消融实验优化预训练与持续预训练策略,并提供完整的训练配方、日志和源代码,确保模型性能可验证、可复现且易于扩展。

链接: https://arxiv.org/abs/2603.03543
作者: Nicholas Kluge Corrêa,Aniket Sen,Shiza Fatimah,Sophia Falk,Lennard Landgraf,Julia Kastner,Lucie Flek
机构: Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence; Center for Science and Thought; Helmholtz-Institut für Strahlen- und Kernphysik; Bonn Sustainable AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.

[NLP-44] RAG -X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在医疗领域应用中评估不足的问题,尤其是现有基准测试仅聚焦于简单的多选题(Multiple-Choice Question, MCQ)任务,且评价指标无法准确捕捉复杂问答任务所需的语义精度,导致开发者难以区分错误源于检索模块还是生成模块,从而阻碍针对性优化。其解决方案的关键在于提出RAG-X诊断框架,该框架通过信息抽取、短答案生成和多选题回答三类任务独立评估检索器与生成器,并引入上下文利用效率(Context Utilization Efficiency, CUE)指标,将系统表现分解为可解释的四象限,从而区分真实基于证据的准确性与表面虚假准确性的差异,揭示隐藏的失败模式,提升临床RAG系统的可验证性和安全性。

链接: https://arxiv.org/abs/2603.03541
作者: Aswini Sivakumar,Vijayan Sugumaran,Yao Qiang
机构: Oakland University (奥克兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.

[NLP-45] MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在药物发现任务中缺乏可靠科学理解与性能的问题,尤其是在依赖上下文学习(in-context learning)时表现不稳定,单纯扩大模型规模或引入推理标记(reasoning tokens)也无法显著提升效果。解决方案的关键在于构建一个名为MMAI Gym for Science的综合性平台,该平台整合了分子数据格式与模态、以及针对特定任务的推理、训练和基准测试方案,旨在教会基础模型掌握“分子语言”,从而高效解决实际药物发现问题。基于此平台训练出的液态基础模型(Liquid Foundation Model, LFM)虽规模较小,却在分子优化、ADMET性质预测、逆合成分析、药物-靶点活性预测及官能团推理等关键任务上达到接近专业模型的性能,并在多数场景中超越更大规模的通用或专用模型,同时具备更高的效率和更广的应用范围。

链接: https://arxiv.org/abs/2603.03517
作者: Maksim Kuznetsov,Zulfat Miftahutdinov,Rim Shayakhmetov,Mikolaj Mizera,Roman Schutski,Bogdan Zagribelnyy,Ivan Ilin,Nikita Bondarev,Thomas MacDougall,Mathieu Reymond,Mihir Bafna,Kaeli Kaymak-Loveless,Eugene Babin,Maxim Malkov,Mathias Lechner,Ramin Hasani,Alexander Amini,Vladimir Aladinskiy,Alex Aliper,Alex Zhavoronkov
机构: Insilico Medicine(英矽智能); Liquid AI(液态AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose large language models (LLMs) that rely on in-context learning do not reliably deliver the scientific understanding and performance required for drug discovery tasks. Simply increasing model size or introducing reasoning tokens does not yield significant performance gains. To address this gap, we introduce the MMAI Gym for Science, a one-stop shop molecular data formats and modalities as well as task-specific reasoning, training, and benchmarking recipes designed to teach foundation models the ‘language of molecules’ in order to solve practical drug discovery problems. We use MMAI Gym to train an efficient Liquid Foundation Model (LFM) for these applications, demonstrating that smaller, purpose-trained foundation models can outperform substantially larger general-purpose or specialist models on molecular benchmarks. Across essential drug discovery tasks - including molecular optimization, ADMET property prediction, retrosynthesis, drug-target activity prediction, and functional group reasoning - the resulting model achieves near specialist-level performance and, in the majority of settings, surpasses larger models, while remaining more efficient and broadly applicable in the domain.

[NLP-46] A theoretical model of dynamical grammatical gender shifting based on set-valued set function

【速读】: 该论文旨在解决跨语言中名词的形态标记(如语法性别、可数性等)变异规律问题,特别是如何通过形式化模型揭示词素与形态模板之间非线性动态映射的底层机制。其解决方案的关键在于提出一种基于模板的模块化认知模型(Template-Based and Modular Cognitive model),该模型以集合值函数 $ h : \mathscr{P}(M) \rightarrow \mathscr{P}(M) $ 为数学基础,将词汇项与其形态模板进行配对,并通过模板转换解释语法性别及其他形态标记的变化,从而统一建模不同语言中的形态标记复杂性,尤其在Riffian语中验证了该模型对名词内部派生及性别转换的有效性。

链接: https://arxiv.org/abs/2603.03510
作者: Mohamed El Idrissi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures, 4 tables

点击查看摘要

Abstract:This study investigates the diverse characteristics of nouns, focusing on both semantic (e.g., countable/uncountable) and morphosyntactic (e.g., masculine/feminine) distinctions. We explore inter-word variations for gender markers in noun morphology. Grammatical gender shift is a widespread phenomenon in languages around the world. The aim is to uncover through a formal model the underlying patterns governing the variation of lexemes. To this end, we propose a new computational component dedicated to pairing items with morphological templates (e.g., the result of a generated item-template pair: (funas, \N, +SG, -PL, -M, +F, -COL, +SING\ ), with its spell-out form: ð a-funast ‘cow’). This process is formally represented by the Template-Based and Modular Cognitive model. This proposed model, defined by a set-valued set function h : \mathscrP(M) \rightarrow \mathscrP(M) , predicts the nonlinear dynamic mapping of lexical items onto morphological templates. By applying this formalism, we present a unified framework for understanding the complexities of morphological markings across languages. Through empirical observations, we demonstrate how these shifts, as well as non-gender shifts, arise during lexical changes, especially in Riffian. Our model posits that these variant markings emerge due to template shifts occurring during word and meaning’s formation. By formally demonstrating that conversion is applicable to noun-to-noun derivation, we challenge and broaden the conventional view of word formation. This mathematical model not only contributes to a deeper understanding of morphosyntactic variation but also offers potential applications in other fields requiring precise modelling of linguistic patterns.

[NLP-47] Raising Bars Not Parameters: LilMoo Compact Language Model for Hindi

【速读】: 该论文旨在解决大型多语言基础模型在自然语言处理(Natural Language Processing, NLP)中加剧的语言不平等现象,特别是低资源语言(如印地语)因数据稀缺和模型依赖性而被边缘化的问题。其解决方案的关键在于构建一个完全透明且可复现的训练流程,从头开始训练一个0.6亿参数的印地语专用语言模型LilMoo,而非依赖于黑箱多语言预训练模型。通过高质量的印地语语料库GigaLekh(结合启发式与大语言模型作为裁判的过滤方法)以及精心筛选的英-印地语双语增强数据,LilMoo在有限计算资源下实现了对同类规模多语言基线模型(如Qwen2.5-0.5B和Qwen3-0.6B)的超越,证明了针对特定语言设计的预训练策略可在子十亿参数级别上媲美大型多语言模型。

链接: https://arxiv.org/abs/2603.03508
作者: Shiza Fatimah,Aniket Sen,Sophia Falk,Florian Mai,Lucie Flek,Nicholas Kluge Corrêa
机构: Bonn-Aachen International Center for Information Technology (b-it) / CAISA Lab; Lamarr Institute for Machine Learning and Artificial Intelligence; Bonn Sustainable AI Lab; Helmholtz-Institut für Strahlen- und Kernphysik
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.

[NLP-48] When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning ICLR2026

【速读】: 该论文旨在解决当前数学推理模型在教育、自动辅导和决策支持系统中广泛应用时所表现出的根本性计算不稳定性问题,即模型的准确率指标可能掩盖其内部推理路径的不可靠性。解决方案的关键在于引入新颖的忠实度(faithfulness)度量方法,对模型推理过程进行细粒度分析,揭示出高准确率背后存在大量计算不一致的推理路径(占正确预测的81.6%),并发现模型存在“沉默失败”(silent failures)现象(8.8%的自信错误输出)。这一方法强调必须超越单一样本准确性,建立能够评估推理稳定性的新评价体系,从而推动模型可靠性提升与评测范式的革新。

链接: https://arxiv.org/abs/2603.03475
作者: Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary
机构: AWS Generative AI Innovation Center (AWS生成式人工智能创新中心); Meta AI (Meta人工智能); Stanford University (斯坦福大学); Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2026 Workshop on Latent Implicit Thinking - Going Beyond CoT Reasoning. 19 Pages and 5 Figures

点击查看摘要

Abstract:Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures – confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.

[NLP-49] Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformers MLP Budget

【速读】: 该论文旨在解决生成式 AI(Generative AI)中Transformer模型中多层感知机(MLP)非线性模块是否真正必要这一关键问题。研究发现,MLP的非线性并非在所有情况下都不可或缺,其必要性高度依赖于上下文(contextual),且无法通过token身份预测(跨语料库相关系数 r < 0.05)。解决方案的关键在于引入一个轻量级门控机制(gate),该机制仅用 d+1 个参数动态决定何时用线性替代模型中的完整MLP。该门控策略利用了MLP计算中大多数实例近似线性的分布特性,在GPT-2中实现25–56%的线性路由,仅带来1%的困惑度(perplexity)损失;进一步通过分阶段训练优化,成功将性能提升至优于基线模型,验证了部分MLP层的非线性反而会损害模型表现。

链接: https://arxiv.org/abs/2603.03459
作者: Peter Balogh
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate when transformer MLP nonlinearity is actually necessary. A gate with d+1 parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ( r 0.05 ). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at 1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B’s full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement – and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.

[NLP-50] Asymmetric Goal Drift in Coding Agents Under Value Conflict ICLR2026

【速读】: 该论文试图解决的问题是:在长期运行、多步骤且环境复杂的现实场景中,自主性编码代理(agentic coding agents)如何在系统提示(system prompt)约束与模型自身习得的价值观(如安全性和隐私保护)之间进行权衡,并识别其是否会发生目标漂移(goal drift)。现有研究依赖静态合成环境,难以捕捉真实世界的动态压力和价值冲突。解决方案的关键在于构建一个基于OpenCode的框架,用于执行真实、多步骤的编码任务,从而量化代理在有无环境压力的情况下违反系统提示的程度。实验表明,GPT-5 mini、Haiku 4.5 和 Grok Code Fast 1 均表现出不对称的目标漂移现象——当系统提示与强价值观冲突时更易违规;且目标漂移由价值对齐度、对抗性压力和累积上下文三个因素共同驱动,揭示了浅层合规检查的不足,并指出评论类环境压力可利用模型的价值层级来绕过系统指令。

链接: https://arxiv.org/abs/2603.03456
作者: Magnus Saebo,Spencer Gibson,Tyler Crosse,Achyutha Menon,Eyon Jang,Diogo Cruz
机构: Columbia University (哥伦比亚大学); Georgia Tech (佐治亚理工学院); UC San Diego (加州大学圣地亚哥分校); MATS; SPAR
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 5 pages, 4 figures, Published as a workshop paper in Lifelong Agents @ ICLR 2026

点击查看摘要

Abstract:Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. Throughout an agent’s lifetime, it must navigate tensions between explicit instructions, learned values, and environmental pressures, often in contexts unseen during training. Prior work on model preferences, agent behavior under value tensions, and goal drift has relied on static, synthetic settings that do not capture the complexity of real-world environments. To this end, we introduce a framework built on OpenCode to orchestrate realistic, multi-step coding tasks to measure how agents violate explicit constraints in their system prompt over time with and without environmental pressure toward competing values. Using this framework, we demonstrate that GPT-5 mini, Haiku 4.5, and Grok Code Fast 1 exhibit asymmetric drift: they are more likely to violate their system prompt when its constraint opposes strongly-held values like security and privacy. We find for the models and values tested that goal drift correlates with three compounding factors: value alignment, adversarial pressure, and accumulated context. However, even strongly-held values like privacy show non-zero violation rates under sustained environmental pressure. These findings reveal that shallow compliance checks are insufficient and that comment-based pressure can exploit model value hierarchies to override system prompt instructions. More broadly, our findings highlight a gap in current alignment approaches in ensuring that agentic systems appropriately balance explicit user constraints against broadly beneficial learned preferences under sustained environmental pressure.

[NLP-51] Farther the Shift Sparser the Representation: Analyzing OOD Mechanisms in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对分布外(out-of-distribution, OOD)输入时,其内部表征如何动态适应的问题。研究发现,随着任务难度增加(如更复杂的推理问题、更长的上下文或引入干扰选项),LLM最后隐藏层的表征会显著变稀疏——即“距离越远,表征越稀疏”。这一现象在不同模型和领域中具有一致性,表明模型通过将计算集中在特定子空间来应对复杂或陌生输入。解决方案的关键在于识别并利用这种稀疏性与难度之间的定量关系,提出了一种基于稀疏性的课程学习策略(Sparsity-Guided Curriculum In-Context Learning, SG-ICL),该策略通过显式地以表征稀疏度为调度信号来优化少样本演示(few-shot demonstrations)的顺序,从而大幅提升模型在OOD场景下的推理稳定性与性能。

链接: https://arxiv.org/abs/2603.03415
作者: Mingyu Jin,Yutong Yin,Jingcheng Niu,Qingcheng Zeng,Wujiang Xu,Mengnan Du,Wei Cheng,Zhaoran Wang,Tianlong Chen,Dimitris N. Metaxas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf\textitthe farther the shift, the sparser the representations. This sparsity–difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textitSparsity-Guided Curriculum In-Context Learning (SG-ICL), a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: this https URL.

[NLP-52] racing Pharmacological Knowledge In Large Language Models ICLR2026

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在药物发现和药理学任务中表现出强大性能,但其内部如何编码和存储药理学知识的机制尚不清晰。为解答这一问题,作者采用因果干预(activation patching)与线性探针(linear probing)相结合的可解释性方法,系统分析了基于Llama的生物医学语言模型中药物类别语义的表示与检索机制。解决方案的关键在于:通过激活修补定位药物类别信息在模型各层和标记位置中的存储分布,并结合token级与池化后表示的线性探针验证其语义表征特性;结果表明,药物类别语义主要由早期层编码,且集中在药物类别跨度内的中间token而非最终token,同时语义信息以分布式方式存在于嵌入空间中——token级探针表现接近随机,而对池化后的表示进行探针则达到最高准确率。这揭示了药理学知识在LLMs中并非依赖单个token,而是通过跨token的分布式表征实现。

链接: https://arxiv.org/abs/2603.03407
作者: Basil Hasan Khwaja,Dylan Chen,Guntas Toor,Anastasiya Kuznetsova
机构: Purdue University (普渡大学); University of Southern California (南加州大学); Queen’s University (皇后大学); Scripps Research (斯克里普斯研究所)
类目: Computation and Language (cs.CL)
备注: Accepted, Learning Meaningful Representations of Life (LMRL) Workshop @ ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown strong empirical performance across pharmacology and drug discovery tasks, yet the internal mechanisms by which they encode pharmacological knowledge remain poorly understood. In this work, we investigate how drug-group semantics are represented and retrieved within Llama-based biomedical language models using causal and probing-based interpretability methods. We apply activation patching to localize where drug-group information is stored across model layers and token positions, and complement this analysis with linear probes trained on token-level and sum-pooled activations. Our results demonstrate that early layers play a key role in encoding drug-group knowledge, with the strongest causal effects arising from intermediate tokens within the drug-group span rather than the final drug-group token. Linear probing further reveals that pharmacological semantics are distributed across tokens and are already present in the embedding space, with token-level probes performing near chance while sum-pooled representations achieve maximal accuracy. Together, these findings suggest that drug-group semantics in LLMs are not localized to single tokens but instead arise from distributed representations. This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.

[NLP-53] Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中基于成对人类偏好数据构建的排名存在统计不确定性的问题。现有方法依赖于点估计,将排名视为固定对象,忽略了因估计噪声和上下文依赖性导致的性能波动,从而可能导致错误决策和福利损失。解决方案的关键在于提出一种决策安全的排名推断框架,其核心是采用情境化Bradley-Terry-Luce模型(contextual Bradley-Terry-Luce model),将模型潜在效用建模为输入提示(prompt)的函数,并直接对诱导出的排名进行统计推断,而非仅估计效用点值。通过构造基于成对效用差异的联合置信区间,该框架可生成针对特定提示的边际与联合置信集,从而提供严格的统计保证,识别真正具有显著差异的排名关系,避免在数据不支持时做出错误的全序判断,转而返回部分序(partial order),实现稳健的基于排名的决策。

链接: https://arxiv.org/abs/2603.03336
作者: Angel Rodrigo Avelar Menendez,Yufeng Liu,Xiaowu Dai
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.

[NLP-54] Compressed Sensing for Capability Localization in Large Language Models

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中特定能力(如数学推理、代码生成等)的组织机制不明确,即这些能力是否分布在整个模型中,还是由局部、稀疏的组件实现。解决方案的关键在于发现并定位这些能力所依赖的注意力头(attention heads),通过压缩感知(compressed sensing)方法,利用这些头在功能上的稀疏性,仅需少量针对性的“敲除”(knockouts)和模型评估即可高效识别出任务特定的注意力头。实验表明,仅移除5个任务相关的注意力头即可导致性能下降高达65%,而其他任务几乎不受影响,揭示了Transformer架构中能力的模块化与局部化组织原则。

链接: https://arxiv.org/abs/2603.03335
作者: Anna Bair,Yixuan Even Xu,Mingjie Sun,J. Zico Kolter
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to 65% on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at this https URL.

[NLP-55] he CompMath-MCQ Dataset: Are LLM s Ready for Higher-Level Math?

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学推理评估中对研究生级别及计算数学领域覆盖不足的问题。现有评测主要集中在初等数学、竞赛题或形式化定理证明,缺乏对高等数学与科学计算能力的系统性考察。其解决方案的关键在于构建一个全新的多选题基准数据集 CompMath-MCQ,包含由研究生课程教授原创编写的1,500道题目,涵盖线性代数、数值优化、向量微积分、概率论及基于Python的科学计算等核心领域,并通过跨模型不一致性和专家人工审核确保题目有效性与无数据泄露。该设计支持通过 lm_eval 库实现客观、可复现且无偏的评估,从而为高级数学推理能力提供标准化测试平台。

链接: https://arxiv.org/abs/2603.03334
作者: Bianca Raimondi,Francesco Pivi,Davide Evangelista,Maurizio Gabbrielli
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1,500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: this https URL

[NLP-56] raining-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中速度慢的问题,尤其是通过推测解码(Speculative Decoding)技术在不牺牲任务性能的前提下提升推理效率。其核心挑战在于如何更高效地匹配轻量级草稿模型(draft model)生成的token与目标模型(target model)的预测分布,从而提高接受率并避免对目标模型分布造成显著偏移。解决方案的关键在于提出DropMatch方法,该方法仅在语言模型头部(LM head)应用蒙特卡洛dropout(Monte Carlo dropout),生成多个解码路径以形成经验token分布,并基于此分布进行采样驱动的接受决策;这一机制能够在无需训练、数据或校准的情况下自适应控制解码路径规模,同时保持与现有推测解码技术的正交集成性,最终实现更高的接受长度和推理加速效果(最高达1.33倍)。

链接: https://arxiv.org/abs/2603.03333
作者: Jeongtae Lee,Minjung Jo,Hyunjoon Jeong,Gunho Park,Sunghyeon Woo,Joonghoon Kim,Se Jung Kwon,Dongsoo Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.

[NLP-57] Frag ile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在大型语言模型(Large Language Models, LLMs)中对推理过程中扰动的鲁棒性问题,特别是针对五类结构化扰动类型(MathError、UnitConversion、Sycophancy、SkippedSteps 和 ExtraSteps)的影响进行系统评估。其解决方案的关键在于构建一个全面的实证框架,通过在不同参数规模(3B 到 1.5T)的13个模型上注入扰动并测量数学推理任务准确率的变化,揭示了扰动类型与模型规模之间的非均匀脆弱性模式,并发现模型规模对部分扰动具有幂律尺度保护效应,但对维度转换类任务仍缺乏有效防御能力。这一结果为多阶段推理流水线中LLM的部署提供了关键的鲁棒性基准和针对性优化方向。

链接: https://arxiv.org/abs/2603.03332
作者: Ashwath Vaithinathan Aravindan,Mayank Kejriwal
机构: University of Southern California (南加州大学); Information Sciences Institute (信息科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textitMathError, UnitConversion, Sycophancy, SkippedSteps, and \textitExtraSteps. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnoteAssumed parameter count of closed models), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6%) regardless of scale; Sycophancy produces modest effects (7% loss for small models); and SkippedSteps cause intermediate damage (15% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \hrefthis https URLhere.

[NLP-58] PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

【速读】: 该论文旨在解决现有光电容积脉搏波描记法(Photoplethysmography, PPG)数据集在支持语言驱动的生理推理和多模态基础模型研究方面的局限性,即多数数据集仅提供数值测量或任务特定标签,缺乏自然语言层面的标注。其解决方案的关键在于构建一个大规模、标准化的PPG-文本问答(QA)数据集PulseLM,该数据集将来自15个公开来源的PPG波形与315万个问题-答案对进行对齐,并统一为12个常见生理学QA任务,从而实现原始PPG信号与自然语言之间的映射,为多模态生理推理和可扩展的PPG语言模型基准测试提供统一基础。

链接: https://arxiv.org/abs/2603.03331
作者: Hung Manh Pham,Jinyang Wu,Xiao Ma,Yiming Zhang,Yixin Xu,Aaqib Saeed,Bin Zhu,Zhou Pan,Dong Ma
机构: Singapore Management University(新加坡管理大学); Queen Mary University of London(伦敦玛丽女王大学); Eindhoven University of Technology(埃因霍温理工大学); University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PulseLM v1

点击查看摘要

Abstract:Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their suitability for language-based physiological reasoning and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text dataset designed to bridge raw PPG waveforms and natural language through a unified, closed-ended question answering (QA) formulation. PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks. The dataset comprises 1.31 million standardized 10-second PPG segments, associated with 3.15 million question-answer pairs. We further define reproducible preprocessing, supervision, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying multimodal physiological reasoning, cross-dataset generalization, and scalable benchmarking of PPG-based language models. The data and code can be found publicly available at: this https URL.

[NLP-59] Certainty robustness: Evaluating LLM stability under self-challenging prompts

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在面对质疑时表现出的不确定性鲁棒性不足的问题,即模型在交互式场景中缺乏对自身置信度的合理响应能力,导致其可能在受到挑战时错误地更改正确答案。解决方案的关键在于提出一个两轮评估框架——Certainty Robustness Benchmark(确定性鲁棒性基准),该框架通过引入自我质疑提示(如“你确定吗?”)和显式矛盾提示(如“你错了!”),结合数值化置信度获取机制,系统评估模型在交互压力下的稳定性与适应性,并区分合理的自我修正与非理性的答案变更。这一方法揭示了传统单轮准确率指标无法捕捉的交互可靠性差异,为模型对齐、可信度提升及实际部署提供了新的评价维度。

链接: https://arxiv.org/abs/2603.03330
作者: Mohammadreza Saadat,Steve Nemzer
机构: TELUS Digital (TELUS数字)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 tables Benchmark and evaluation study of large language models

点击查看摘要

Abstract:Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness or confidence calibration, they do not capture how models behave when their responses are challenged in interactive settings. We introduce the Certainty Robustness Benchmark, a two-turn evaluation framework that measures how LLMs balance stability and adaptability under self-challenging prompts such as uncertainty (“Are you sure?”) and explicit contradiction (“You are wrong!”), alongside numeric confidence elicitation. Using 200 reasoning and mathematics questions from LiveBench, we evaluate four state-of-the-art LLMs and distinguish between justified self-corrections and unjustified answer changes. Our results reveal substantial differences in interactive reliability that are not explained by baseline accuracy alone: some models abandon correct answers under conversational pressure, while others demonstrate strong resistance to challenge and better alignment between confidence and correctness. These findings identify certainty robustness as a distinct and critical dimension of LLM evaluation, with important implications for alignment, trustworthiness and real-world deployment.

[NLP-60] AutoHarness: improving LLM agents by automatically synthesizing a code harness

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为智能体(Agent)在实际环境中执行动作时,常因违反环境规则而产生非法行为的问题,例如在棋类游戏中做出非法走子。研究发现,这种问题在实践中普遍存在,如Gemini-2.5-Flash在Kaggle GameArena比赛中78%的失败源于非法移动。解决方案的关键在于利用少量迭代式代码精炼(iterative code refinement)过程,使模型自动合成一个“代码约束器”(code harness),该约束器能有效屏蔽非法动作,并在145个TextArena游戏中实现零非法操作。进一步地,通过极限优化,该方法还可让模型生成完整的决策策略代码(code-policy),从而完全替代LLM在决策阶段的使用,且在16个单人TextArena游戏中获得比Gemini-2.5-Pro和GPT-5.2-High更高的平均奖励,证明了小模型通过自动生成定制化代码策略可超越大模型性能并更具成本效益。

链接: https://arxiv.org/abs/2603.03329
作者: Xinghua Lou,Miguel Lázaro-Gredilla,Antoine Dedieu,Carter Wendelken,Wolfgang Lehrach,Kevin P. Murphy
机构: Google DeepMind
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: agent harness, code synthesis, self-improvement, code-as-policy, text games

点击查看摘要

Abstract:Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write “harnesses” around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

[NLP-61] StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

【速读】: 该论文试图解决现有语言模型可解释性研究中对层间全局结构关系关注不足的问题,尤其是当前方法多聚焦于层内或模块内的局部token间关系(如多头注意力机制),而忽略了跨层之间的整体结构关联。其解决方案的关键在于提出StructLens框架,该框架通过构建基于残差流语义表示的最大生成树(maximum spanning tree),类比依存句法分析的方式揭示层内token连接的全局结构特征,并利用树结构属性从结构角度量化层间距离或相似性。这一结构感知的相似性指标与传统余弦相似性显著不同,且在实际任务(如层剪枝)中表现出优越性能,验证了结构分析在理解与优化语言模型中的有效性。

链接: https://arxiv.org/abs/2603.03328
作者: Haruki Sakajo,Frederikus Hudi,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest internal structures as well. While interpretability research has investigated the components of language models, existing approaches focus on local inter-token relationships within layers or modules (e.g., Multi-Head Attention), leaving global inter-layer relationships largely overlooked. To address this gap, we introduce StructLens, an analytical framework designed to reveal how internal structures relate holistically through their inter-token connection within a layer. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, analogous to dependency parsing, and leverages the tree properties to quantify inter-layer distance (or similarity) from a structural perspective. Our findings demonstrate that StructLens yields an inter-layer similarity pattern that is distinctively different from conventional cosine similarity. Moreover, this structure-aware similarity proves to be beneficial for practical tasks, such as layer pruning, highlighting the effectiveness of structural analysis for understanding and optimizing language models. Our code is available at this https URL.

[NLP-62] A benchmark for joint dialogue satisfaction emotion recognition and emotion state transition prediction

【速读】: 该论文旨在解决中文对话场景中用户满意度预测受限于现有数据集匮乏及情绪动态变化难以捕捉的问题,尤其指出单轮对话无法充分反映多轮交互中的情绪演变,从而影响满意度预测的准确性。其解决方案的关键在于构建了一个多任务、多标签的中文对话数据集,该数据集同时支持满意度识别、情绪识别以及情绪状态转换预测,为研究对话系统中的情感与满意度关系提供了新的高质量数据资源。

链接: https://arxiv.org/abs/2603.03327
作者: Jing Bian,Haoxiang Su,Liting Jiang,Di Wu,Ruiyu Fang,Xiaomeng Huang,Yanbing Li,Shuangyong Song,Hao Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User satisfaction is closely related to enterprises, as it not only directly reflects users’ subjective evaluation of service quality or products, but also affects customer loyalty and long-term business revenue. Monitoring and understanding user emotions during interactions helps predict and improve satisfaction. However, relevant Chinese datasets are limited, and user emotions are dynamic; relying on single-turn dialogue cannot fully track emotional changes across multiple turns, which may affect satisfaction prediction. To address this, we constructed a multi-task, multi-label Chinese dialogue dataset that supports satisfaction recognition, as well as emotion recognition and emotional state transition prediction, providing new resources for studying emotion and satisfaction in dialogue systems.

[NLP-63] Controllable and explainable personality sliders for LLM s at inference time

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化人格对齐时面临的效率与灵活性问题:传统方法如监督微调(Supervised Fine-Tuning, SFT)或基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)虽有效,但需为每个目标人格特征训练独立模型,成本高昂且缺乏可扩展性;而推理阶段的激活控制(inference-time activation steering)虽参数高效,却因向量干扰无法同时调控多个性格维度。其解决方案的关键在于提出一种模块化框架——顺序自适应引导(Sequential Adaptive Steering, SAS),通过在残差流上逐次训练探测器并实现向量正交化,使引导向量成为可复用的基本单元,从而支持用户仅通过调整系数α即可即时合成复杂、高保真的人格配置,实现无需更新模型参数的连续多维人格调控。

链接: https://arxiv.org/abs/2603.03326
作者: Florian Hoppe,David Khachaturov,Robert Mullins,Mark Huasong Meng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 18 figures

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with specific personas typically relies on expensive and monolithic Supervised Fine-Tuning (SFT) or RLHF. While effective, these methods require training distinct models for every target personality profile. Inference-time activation steering offers a parameter-efficient alternative, yet naive approaches fail to control multiple traits simultaneously due to destructive vector interference. In this work, we propose a modular framework for continuous, multi-dimensional personality control. Our key innovation is Sequential Adaptive Steering (SAS): a method that orthogonalizes steering vectors by training subsequent probes on the residual stream shifted by prior interventions. This approach transforms steering vectors into reusable primitives, allowing users to instantly synthesize complex, high-fidelity personality profiles by simply adjusting coefficients alpha. We validate our framework on the Big Five personality traits, demonstrating that it outperforms naive baselines in both goal adherence and coherence, enabling precise, holistic personality modulation without updating model parameters.

[NLP-64] IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人机协同工作流中进行上下文感知意图理解时的挑战,即如何准确推理用户在特定情境下的真实意图,而不仅限于静态的意图识别。现有方法忽视了用户历史意图模式对当前意图推断的价值,导致泛化能力不足。解决方案的关键在于提出IntPro代理,通过检索条件下的意图推理机制,将用户的历史意图解释(intent explanations)抽象并存储于个体意图历史库中,从而支持基于上下文的动态检索与利用;同时采用监督微调和工具感知奖励函数的多轮组相对策略优化(multi-turn Group Relative Policy Optimization, GRPO),使代理学会何时依赖历史模式、何时直接推理,显著提升了跨场景和模型类型的意图理解性能。

链接: https://arxiv.org/abs/2603.03325
作者: Guanming Liu,Meng Wu,Peng Zhang,Yu Zhang,Yubo Shu,Xianliang Huang,Kainan Tu,Ning Gu,Liuxin Zhang,Qianying Wang,Tun Lu
机构: Fudan University (复旦大学); Lenovo Group Ltd (联想集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become integral to modern Human-AI collaboration workflows, where accurately understanding user intent serves as a crucial step for generating satisfactory responses. Context-aware intent understanding, which involves inferring user intentions from situational environments, is inherently challenging because it requires reasoning over both the immediate context and the user’s underlying motivations that drive their behavior. Moreover, existing approaches often treat intent understanding as a static recognition task, overlooking users’ accumulated intent patterns that could provide valuable references for more accurate and generalizable understanding. To address this gap, we propose IntPro, a proxy agent that learns to adapt to individual users via retrieval-conditioned intent inference. We design intent explanations that abstract how contextual signals connect to expressed intents, and store them in an individual intent history library for retrieval. We train IntPro through supervised fine-tuning on retrieval-conditioned trajectories and multi-turn Group Relative Policy Optimization (GRPO) with tool-aware reward functions, enabling the agent to learn when to leverage historical intent patterns and when to infer directly. Experiments across three diverse scenarios (Highlight-Intent, MIntRec2.0, and Weibo Post-Sync) demonstrate that IntPro achieves strong intent understanding performance with effective context-aware reasoning capabilities across different scenarios and model types.

[NLP-65] Controlling Chat Style in Language Models via Single-Direction Editing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中风格属性控制难题,现有方法依赖提示工程或后训练对齐,难以实现精确、灵活的风格调控。其解决方案的关键在于提出一种基于表征工程(representation engineering)的新范式,发现不同风格属性(如情感基调和语言结构)在模型激活空间中可被编码为线性方向,并据此设计了一种无需训练的轻量级方法,实现了风格的线性组合、安全可控(通过消除不良行为)以及高风格保真度与核心能力保留,且计算开销极低。

链接: https://arxiv.org/abs/2603.03324
作者: Zhenyu Xu,Victor S. Sheng
机构: Texas Tech University (德克萨斯理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model’s activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.

[NLP-66] Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中出现的“过度拒绝”(over-refusal)问题,即模型将本应无害的提示错误地识别为有毒内容而拒绝响应,从而损害模型的实用性与适用性。解决方案的关键在于引入一个前置对齐阶段——DCR(Discernment via Contrastive Refinement,对比精炼辨识),通过对比精炼机制增强模型区分真正有害提示与表面看似有害提示的能力。理论分析与实证结果表明,该方法能够在显著降低过量拒绝行为的同时保持模型的安全性,并且对通用能力影响最小,提供了一种更稳健、更可解释的安全对齐路径。

链接: https://arxiv.org/abs/2603.03323
作者: Yuxiao Lu,Lin Xu,Yang Sun,Wenjun Li,Jie Shi
机构: Huawei Technologies co. ltd(华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 Pages

点击查看摘要

Abstract:Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models’ helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model’s ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model’s learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM’s capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

[NLP-67] Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在生物知识发现能力评估中存在的两大核心问题:一是现有基准测试多依赖静态数据集,导致模型训练过程中可能已接触过评估知识,从而引发数据污染;二是由于LLM迭代速度快,静态基准迅速过时,无法有效衡量模型发现真正新知识的能力。解决方案的关键在于提出DBench-Bio——一个动态且全自动的基准框架,其核心创新在于构建了一个三阶段自动化流水线:首先从权威文献中获取高质量论文摘要,其次利用LLM自动生成科学假说类问题及对应的知识发现答案,最后通过相关性、清晰度和中心性三个维度过滤保证质量。该框架实现了每月更新、覆盖12个生物医学子领域的动态知识发现评估体系,为AI研究社区提供了一个持续演进的资源平台,首次系统性地支持对AI新知识发现能力的量化评估。

链接: https://arxiv.org/abs/2603.03322
作者: Chaoqun Yang,Xinyu Lin,Shulin Li,Wenjie Wang,Ruihan Guo,Fuli Feng,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI’s capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI’s biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.

[NLP-68] DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following PAKDD2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)指令遵循能力评估中依赖人工标注、缺乏与人类判断模式一致性的难题。现有方法难以自动将指令分解为可验证的语义要求,并在多轮对话场景下失效。其解决方案的关键在于提出 DIALEVAL 框架,该框架基于类型论,利用双 LLM 代理自动将指令分解为带类型的谓词(typed predicates),并为不同类型的谓词定义差异化的满足语义:内容类谓词采用语义等价性标准,数值类谓词则要求精确匹配。此设计通过形式化原子性和独立性约束保障分解质量,并引入历史感知的满足函数扩展至多轮对话场景,从而显著提升评估准确性(90.38%)和与人类判断的相关性。

链接: https://arxiv.org/abs/2603.03321
作者: Nardine Basta,Dali Kaafar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PAKDD 2026

点击查看摘要

Abstract:Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction–tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria–semantic equivalence for content predicates, exact precision for numerical predicates–mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.

[NLP-69] From We to Me: Theory Informed Narrative Shift with Abductive Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在进行叙事转换(narrative shift)时面临的挑战,即如何在保持原始核心信息不变的前提下,将文本内容从一种叙事框架(如集体主义叙事)调整为另一种(如个人主义叙事)。这一任务对现有LLMs而言尤为困难,因其不仅需要语义一致性,还需精准匹配目标群体的叙事结构与价值取向。解决方案的关键在于提出一种基于社会科学研究理论和溯因推理(abductive reasoning)的神经符号方法(neurosymbolic approach),通过自动提取规则来推断实现叙事转变所需的特定故事元素,并以此引导LLM完成连贯且有针对性的叙事重构。实验表明,该方法显著优于零样本基线,在GPT-4o上实现了55.88%的性能提升,同时在语义相似性上也有40.4%的KL散度改善,验证了其有效性与通用性。

链接: https://arxiv.org/abs/2603.03320
作者: Jaikrishna Manojkumar Patil,Divyagna Bavikadi,Kaustuv Mukherji,Ashby Steward-Nolan,Peggy-Jean Allin,Tumininu Awonuga,Joshua Garland,Paulo Shakarian
机构: Syracuse University (雪城大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective communication often relies on aligning a message with an audience’s narrative and worldview. Narrative shift involves transforming text to reflect a different narrative framework while preserving its original core message–a task we demonstrate is significantly challenging for current Large Language Models (LLMs). To address this, we propose a neurosymbolic approach grounded in social science theory and abductive reasoning. Our method automatically extracts rules to abduce the specific story elements needed to guide an LLM through a consistent and targeted narrative transformation. Across multiple LLMs, abduction-guided transformed stories shifted the narrative while maintaining the fidelity with the original story. For example, with GPT-4o we outperform the zero-shot LLM baseline by 55.88% for collectivistic to individualistic narrative shift while maintaining superior semantic similarity with the original stories (40.4% improvement in KL divergence). For individualistic to collectivistic transformation, we achieve comparable improvements. We show similar performance across both directions for Llama-4, and Grok-4 and competitive performance for Deepseek-R1.

[NLP-70] Automated Concept Discovery for LLM -as-a-Judge Preference Analysis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为评价者时存在的系统性偏好偏差问题,尤其是这些偏差往往未被预先定义且难以自动发现。现有研究多聚焦于少数已知偏见假设,缺乏对未知驱动因素的系统性挖掘能力。解决方案的关键在于引入基于嵌入空间的概念提取方法,特别是稀疏自编码器(sparse autoencoder)技术,用于从LLM判官行为中自动识别可解释的偏好特征。该方法在保持预测准确性的前提下显著提升了对LLM偏好机制的可解释性分析能力,并通过超过27,000对来自多个真实人类偏好数据集的响应评估,揭示了包括强调具体性与同理心、偏好学术建议中的细节与正式程度,以及对主动法律行动(如报警或起诉)存在负面倾向等新趋势,从而实现了无需预设偏见分类体系即可系统性解析LLM判官偏好的目标。

链接: https://arxiv.org/abs/2603.03319
作者: James Wedgwood,Chhavi Yadav,Virginia Smith
机构: Carnegie-Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.

[NLP-71] Quantum-Inspired Self-Attention in a Large Language Model

【速读】: 该论文旨在解决传统自注意力机制在语言建模任务中性能优化的瓶颈问题,特别是在字符错误率(Character Error Rate, CER)、词错误率(Word Error Rate, WER)和交叉熵损失(Cross-Entropy Loss)等指标上的提升空间。其解决方案的关键在于提出了一种经典量子启发式自注意力机制(Quantum-Inspired Self-Attention, QISA),并将该机制集成到GPT-1的完整自回归语言建模流程中,这是首次将此类量子自注意力机制应用于语言生成任务而非仅限于文本分类。实验表明,QISA在保持较低计算开销的前提下显著优于标准自注意力机制,在CER、WER和交叉熵损失上分别提升了15.5倍、4.7倍和13倍,同时仅需2.6倍的推理时间。

链接: https://arxiv.org/abs/2603.03318
作者: Nikita Kuznetsov,Niyaz Ismagilov,Ernesto Campos
机构: Higher School of Economics, National Research University, St. Petersburg, Russian Federation (高等经济学院,国家研究大学,圣彼得堡,俄罗斯联邦); JV Quantum LLC, State Atomic Energy Corporation Rosatom, Moscow, Russian Federation (JV量子有限责任公司,国家原子能公司罗萨托姆,莫斯科,俄罗斯联邦); Skolkovo Institute of Science and Technology, Moscow, Russian Federation (斯科尔科沃科学技术研究所,莫斯科,俄罗斯联邦)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Natural Language Processing have been predominantly driven by transformer-based architectures, which rely heavily on self-attention mechanisms to model relationships between tokens in a sequence. Similarly, the field of Quantum Natural Language Processing, which seeks to leverage quantum principles to address challenges in language understanding and generation tasks, has seen the recent development of quantum self-attention mechanisms. We propose a classical quantum-inspired self-attention (QISA) mechanism and integrate it into the full autoregressive language modeling pipeline of GPT-1. To the best of our knowledge, this is the first integration of this kind, as previous quantum self-attention mechanisms have been primarily tested on text classification. In our experiments, QISA achieves better performance when compared to standard self-attention on the metrics character error rate ( 15.5\times better), word error rate ( 4.7 \times ) and cross-entropy loss ( 13 \times ). This is achieved while only requiring a 2.6\times longer inference time.

[NLP-72] Retcon – a Prompt-Based Technique for Precise Control of LLM s in Conversations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话场景中难以实现逐轮行为调控的问题,尤其是在对话过程中需要动态调整模型行为时。其解决方案的关键在于提出了一种名为Retcon的少样本提示(few-shot prompting)技术,该技术通过在每一轮对话中引入特定的控制信号或示例,实现对LLM在对话轮次级别上的精确行为控制,从而显著优于零样本和传统少样本提示方法。

链接: https://arxiv.org/abs/2603.03317
作者: David Kogan,Sam Nguyen,Masanori Suzuki,Feiyang Chen
机构: Google(谷歌)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 3 appendixes with prompts and examples

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) allow agents to execute complex natural language tasks. Many LLM applications, such as support agents, teaching assistants, and interactive bots, involve multi-turn conversations. However, it remains challenging to control LLMs in the context of such interactions, particularly when the LLM behavior needs to be adjustable over the course of the conversation. In this paper, we present Retcon, a few-shot prompting technique designed to provide turn-level control over LLMs in conversations. We then demonstrate that it performs significantly better than zero-shot and traditional few-shot prompting.

[NLP-73] he Influence of Iconicity in Transfer Learning for Sign Language Recognition

【速读】: 该论文旨在解决手势识别中知识迁移的有效性问题,特别是探讨跨语言手势相似性对迁移学习(Transfer Learning, TL)性能的影响。研究聚焦于两种不同手语对(中文到阿拉伯语、希腊语到弗拉芒语)中的象形手势(iconic signs),通过对比其迁移学习表现,验证跨语言相似性是否为有效知识迁移的必要条件。解决方案的关键在于采用 Google Mediapipe 作为输入特征提取器,结合多层感知机(Multilayer Perceptron)处理空间信息与门控循环单元(Gated Recurrent Unit, GRU)建模时序动态,从而实现对符号时空特征的高效捕捉,并在两个跨语言场景下分别获得7.02%和1.07%的准确率提升,表明即使在缺乏显著跨语言相似性的条件下,基于结构化特征表示的迁移学习依然可有效提升手语识别性能。

链接: https://arxiv.org/abs/2603.03316
作者: Keren Artiaga,Conor Lynch,Haithem Afli,Mohammed Hasanuzzaman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most sign language recognition research relies on Transfer Learning (TL) from vision-based datasets such as ImageNet. Some extend this to alternatively available language datasets, often focusing on signs with cross-linguistic similarities. This body of work examines the necessity of these likenesses on effective knowledge transfer by comparing TL performance between iconic signs of two different sign language pairs: Chinese to Arabic and Greek to Flemish. Google Mediapipe was utilised as an input feature extractor, enabling spatial information of these signs to be processed with a Multilayer Perceptron architecture and the temporal information with a Gated Recurrent Unit. Experimental results showed a 7.02% improvement for Arabic and 1.07% for Flemish when conducting iconic TL from Chinese and Greek respectively.

[NLP-74] M-QUEST – Meme Question-Understanding Evaluation on Semantics and Toxicity

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在识别网络迷因(meme)中的毒性内容时面临的挑战,尤其是由于迷因依赖常识知识(commonsense knowledge)且具有多模态特性(如文本、图像、情感等),导致现有方法难以准确理解其语义和意图。解决方案的关键在于提出一个结构化的语义框架(semantic framework),系统性地识别影响迷因解释的10个核心维度:文本材料、视觉材料、场景、背景知识、情感、符号投射(Semiotic Projection)、类比映射(Analogical Mapping)、整体意图、目标社区和毒性评估。基于此框架,研究构建了首个针对迷因毒性的基准数据集 M-QUEST,包含609个关于毒性判断及其原因的常识问答对,并通过评测8个开源大语言模型在该任务上的表现,验证了该框架在促进多模态内容安全与常识推理交叉研究中的有效性。

链接: https://arxiv.org/abs/2603.03315
作者: Stefano De Giorgis,Ting-Chih Chen,Filip Ilievski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, Visual material, Scene, Background Knowledge, Emotion, Semiotic Projection, Analogical Mapping, Overall Intent, Target Community, and Toxicity Assessment. Second, the framework guides a semi-automatic process of generating a benchmark with commonsense question-answer pairs about meme toxicity assessment and its underlying reason. The resulting benchmark M-QUEST consists of 609 question-answer pairs for 307 memes. Thirdly, we evaluate eight open-source large language models on their ability to correctly solve M-QUEST. Our results show that current models’ commonsense reasoning capabilities for toxic meme interpretation vary depending on the dimension and architecture. Models with instruction tuning and reasoning capabilities significantly outperform the others, though pragmatic inference questions remain challenging. We release code, benchmark, and prompts to support future research intersecting multimodal content safety and commonsense reasoning.

[NLP-75] owards Self-Robust LLM s: Intrinsic Prompt Noise Resistance via CoIPO

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不完美或含噪提示(noisy prompts)时表现出的鲁棒性不足问题,尤其在输出格式要求严格或开放性受限的应用场景中,模型性能易受提示微小变化影响。其解决方案的关键在于提出一种基于对比学习的逆直接偏好优化方法(Contrastive Learning-based Inverse Direct Preference Optimization, CoIPO),通过最小化模型在干净提示与对应噪声提示下生成的标签对齐logits之间的差异,增强模型对提示扰动的内在鲁棒性。该方法不依赖外部工具或额外模型进行预处理,而是从模型内部优化角度提升稳定性,并借助互信息理论进行深入分析,同时构建了新的评估基准NoisyPromptBench以系统验证其有效性。

链接: https://arxiv.org/abs/2603.03314
作者: Xin Yang,Letian Li,Abudukelimu Wuerkaixi,Xuxin Cheng,Cao Liu,Ke Zeng,Xunliang Cai,Wenyuan Jiang
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Meituan LongCat Interaction Team; ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real-world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model’s responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning-based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label-aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state-of-the-art approaches. The source code of CoIPO, pair-wise FLAN datasets, and NoisyPromptBench have already been released on this https URL.

[NLP-76] How does fine-tuning improve sensorimotor representations in large language models ?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的“具身鸿沟”(embodiment gap)问题,即其纯文本表征难以与人类的感官运动体验对齐。为应对这一挑战,研究提出通过任务特定微调(task-specific fine-tuning)来引导模型内部表征向更具身体感知基础的方向演化。解决方案的关键在于:利用表示相似性分析(Representational Similarity Analysis, RSA)和维度特异性相关度量,验证微调可有效提升模型在感官运动维度上的表征一致性;同时发现,这种改进具有跨语言和相关感官-运动维度的泛化能力,但高度依赖于学习目标,无法在两类差异较大的任务格式间迁移。

链接: https://arxiv.org/abs/2603.03313
作者: Minghua Wu,Javier Conde,Pedro Reviriego,Marc Brysbaert
机构: Ghent University (根特大学); Universidad Politécnica de Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit a significant “embodiment gap”, where their text-based representations fail to align with human sensorimotor experiences. This study systematically investigates whether and how task-specific fine-tuning can bridge this gap. Utilizing Representational Similarity Analysis (RSA) and dimension-specific correlation metrics, we demonstrate that the internal representations of LLMs can be steered toward more embodied, grounded patterns through fine-tuning. Furthermore, the results show that while sensorimotor improvements generalize robustly across languages and related sensory-motor dimensions, they are highly sensitive to the learning objective, failing to transfer across two disparate task formats.

[NLP-77] he Logovista English–Japanese Machine Translation System

【速读】: 该论文旨在记录Logovista英语-日语机器翻译系统(a large, explicitly rule-based MT system)的架构设计、开发实践及 preserved artifacts,解决的是如何在长期实际应用中维持一个基于规则的机器翻译系统的可维护性、扩展性与稳定性的问题。其解决方案的关键在于:采用手工编写的语法规则、大规模中央词典(编码句法与语义约束)、基于图表的解析(chart-based parsing)以及加权解释评分机制(weighted interpretation scoring),以有效管理结构歧义,并通过持续的回归控制、歧义处理和覆盖范围扩展策略应对真实场景下的复杂需求。

链接: https://arxiv.org/abs/2603.03311
作者: Barton D. Wright
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper documents the architecture, development practices, and preserved artifacts of the Logovista English–Japanese machine translation system, a large, explicitly rule-based MT system that was developed and sold commercially from the early 1990s through at least 2012. The system combined hand-authored grammatical rules, a large central dictionary encoding syntactic and semantic constraints, and chart-based parsing with weighted interpretation scoring to manage extensive structural ambiguity. The account emphasizes how the system was extended and maintained under real-world usage pressures, including regression control, ambiguity management, and the limits encountered as coverage expanded. Unlike many rule-based MT systems described primarily in research settings, Logovista was deployed for decades and evolved continuously in response to practical requirements. The paper is intended as a technical and historical record rather than an argument for reviving rule-based MT, and describes the software and linguistic resources that have been preserved for potential future study. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.03311 [cs.CL] (or arXiv:2603.03311v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.03311 Focus to learn more arXiv-issued DOI via DataCite

[NLP-78] Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention

【速读】: 该论文旨在解决传统大型语言模型(Large Language Model, LLM)推理引擎在固定解码规则下难以动态优化计算资源分配的问题,尤其在高吞吐量与低延迟需求场景中,现有方法将生成过程视为按token时间线性推进,忽略了不确定性(entropy)对计算效率的指导意义。解决方案的关键在于提出一种全新的“熵时推理”(entropic-time inference)范式,其核心是将熵作为第一类控制信号,构建一个自组织推理架构,联合调度、注意力稀疏化和采样温度控制,在统一的熵控制目标下实现资源智能分配。具体包括:基于熵感知的调度策略、熵驱动的分页注意力块剪枝(entropic pruning)、以及自适应温度调节机制,使生成过程转变为以最大化不确定性减少为导向的类热力学计算流程,从而显著提升推理系统的可扩展性和资源利用效率。

链接: https://arxiv.org/abs/2603.03310
作者: Andrew Kiruluta
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern large language model (LLM) inference engines optimize throughput and latency under fixed decoding rules, treating generation as a linear progression in token time. We propose a fundamentally different paradigm: entropic-time inference, where decoding is governed by the flow of uncertainty rather than token index. We introduce a self-organizing inference architecture that jointly couples scheduling, attention sparsification, and sampling temperature under a unified entropy control objective. Our method extends vLLM with entropy-aware scheduling, entropic pruning of paged attention blocks, and adaptive temperature control that stabilizes generation near a target entropy regime. This transforms inference into a resource-intelligent thermodynamic process that allocates computation where uncertainty reduction is maximized. We present a concrete systems design, pseudocode, and integration plan, demonstrating how entropy can serve as a first-class control signal for scalable LLM inference.

[NLP-79] Old Habits Die Hard: How Conversational History Geometrically Traps LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)的对话历史如何影响其后续生成行为的问题,特别是探讨历史交互中的偏差(如幻觉)是否会导致模型在后续对话中表现出不一致或受限的行为模式。解决方案的关键在于提出一个名为History-Echoes的框架,从两个互补视角分析这种偏差:一是概率视角,将对话建模为马尔可夫链以量化状态一致性;二是几何视角,通过测量连续隐藏表示的一致性来捕捉潜在空间中的轨迹偏移。研究发现,这两种视角高度相关,并揭示出行为持续性表现为一种“几何陷阱”,即潜在空间中的间隙限制了模型的生成路径,从而解释了为何历史偏差会持续影响未来输出。

链接: https://arxiv.org/abs/2603.03308
作者: Adi Simhi,Fazl Barez,Martin Tutek,Yonatan Belinkov,Shay B. Cohen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model’s trajectory. Code available at this https URL.

[NLP-80] opicENA: Enabling Epistemic Network Analysis at Scale through Automated Topic-Based Coding

【速读】: 该论文旨在解决传统Epistemic Network Analysis (ENA)在处理大规模文本语料时因依赖人工专家编码而导致的可扩展性不足和实际应用受限的问题。其解决方案的关键在于将BERTopic主题建模方法与ENA相结合,提出TopicENA框架,通过自动提取文本中的主题替代人工概念编码,同时保留ENA对概念间结构关联的建模能力,从而实现可扩展且可解释的网络分析。

链接: https://arxiv.org/abs/2603.03307
作者: Owen H.T. Lu,Tiffany T.Y. Hsu
机构: National Chengchi University (国立政治大学); The London School of Economics and Political Science (伦敦经济学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Epistemic Network Analysis (ENA) is a method for investigating the relational structure of concepts in text by representing co-occurring concepts as networks. Traditional ENA, however, relies heavily on manual expert coding, which limits its scalability and real-world applicability to large text corpora. Topic modeling provides an automated approach to extracting concept-level representations from text and can serve as an alternative to manual coding. To tackle this limitation, the present study merges BERTopic with ENA and introduces TopicENA, a topic-based epistemic network analysis framework. TopicENA substitutes manual concept coding with automatically generated topics while maintaining ENA’s capacity for modeling structural associations among concepts. To explain the impact of modeling choices on TopicENA outcomes, three analysis cases are presented. The first case assesses the effect of topic granularity, indicating that coarse-grained topics are preferable for large datasets, whereas fine-grained topics are more effective for smaller datasets. The second case examines topic inclusion thresholds and finds that threshold values should be adjusted according to topic quality indicators to balance network consistency and interpretability. The third case tests TopicENA’s scalability by applying it to a substantially larger dataset than those used in previous ENA studies. Collectively, these cases illustrate that TopicENA facilitates practical and interpretable ENA analysis at scale and offers concrete guidance for configuring topic-based ENA pipelines in large-scale text analysis.

[NLP-81] oken-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理结构化数据时因JSON格式冗余导致的高token消耗问题,其核心目标是评估Token-Oriented Object Notation (TOON)作为替代JSON的序列化格式在生成任务中的有效性。解决方案的关键在于通过一shot上下文学习(one-shot in-context learning)实现TOON语法的零样本生成,并对比其与传统JSON生成及约束解码(constrained decoding)结构化输出在准确性与token效率上的差异。研究发现,TOON在结构复杂度较高时展现出优越的token利用率,但其优势常被提示开销(prompt tax)所抵消;而约束解码虽能降低token使用量,但在简单结构中反而优于TOON,表明TOON的真正效率潜力可能遵循非线性增长规律,仅在足够长的上下文中才能体现其优势。

链接: https://arxiv.org/abs/2603.03306
作者: Ivan Matveev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 2 tables. Benchmark code and data available at this https URL

点击查看摘要

Abstract:Recently presented Token-Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one-shot in-context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade-off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one-shot in-context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in-domain generation tasks, though this advantage is often reduced by the “prompt tax” of instructional overhead in shorter contexts. Plain JSON generation shows the best one-shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade-off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this “lowest token usage” of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON’s true efficiency potential likely follows a non-linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead. Comments: 9 pages, 2 figures, 2 tables. Benchmark code and data available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.3.3 Cite as: arXiv:2603.03306 [cs.CL] (or arXiv:2603.03306v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.03306 Focus to learn more arXiv-issued DOI via DataCite

[NLP-82] Draft-Conditioned Constrained Decoding for Structured Generation in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成结构化输出(如JSON对象、API调用等)时因语法错误导致输出不可用的问题。传统约束解码(Constrained Decoding)虽能通过掩码和重归一化实现逐标记的有效性控制,但在模型对合法延续分配概率质量较低时,易引导生成走向局部有效但语义错误的轨迹。解决方案的关键在于提出草稿条件约束解码(Draft-Conditioned Constrained Decoding, DCCD),其核心思想是将语义规划与结构强制分离:首先无约束地生成一个草稿(draft),随后基于该草稿进行条件化约束解码,从而确保输出有效性。理论分析表明,草稿条件可提升可行概率质量并降低硬约束带来的累积“投影税”,且支持最佳K个草稿选择策略,显著提升结构化准确率(如GSM8K上达+24个百分点),并以更少参数实现优于大型模型基线的效果。

链接: https://arxiv.org/abs/2603.03305
作者: Avinash Reddy,Thayne T. Walker,James S. Ide,Amrit Singh Bedi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to generate executable outputs, JSON objects, and API calls, where a single syntax error can make the output unusable. Constrained decoding enforces validity token-by-token via masking and renormalization, but it can distort generation when the model assigns low probability mass to valid continuations, pushing decoding toward locally valid yet semantically incorrect trajectories. We propose \emphDraft-Conditioned Constrained Decoding (DCCD), a simple two-step, training-free inference procedure that decouples semantic planning from structural enforcement: an unconstrained draft is generated first, and constrained decoding is then applied, conditioned on this draft, to guarantee validity. We analyze DCCD through a KL-projection view, showing that draft conditioning increases feasible mass and reduces the cumulative “projection tax” induced by hard constraints, with an optional best-of- K draft selection. Across structured reasoning benchmarks, DCCD improves strict structured accuracy by up to +24 percentage points over standard constrained decoding (e.g., 15.2% to 39.0% on GSM8K with a 1B model), and enables smaller model pairs to match or exceed much larger constrained baselines, yielding substantial gains in parameter efficiency.

[NLP-83] HumanLM: Simulating Users with State Alignment Beats Response Imitation

【速读】: 该论文旨在解决现有用户模拟器(user simulator)仅能模仿表层语言模式和风格,而无法准确反映真实用户内在状态(如信念、情绪等心理维度)的问题。传统方法在用户反馈建模中缺乏对心理机制的刻画,导致模拟结果与真实用户行为存在偏差。解决方案的关键在于提出一种名为HumanLM的新训练框架,其核心创新是引入强化学习机制,使模型不仅生成响应文本,还同时生成与真实响应对齐的自然语言潜在状态(latent states),这些潜在状态对应于心理学基础的状态维度(psychologically grounded state dimensions),从而更真实地捕捉驱动用户行为的心理动因;随后,HumanLM将这些对齐的潜在状态合成出高度拟真的用户响应。

链接: https://arxiv.org/abs/2603.03303
作者: Shirley Wu,Evelyn Choi,Arpandeep Khatua,Zhanghan Wang,Joy He-Yueya,Tharindu Cyril Weerasooriya,Wei Wei,Diyi Yang,Jure Leskovec,James Zou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 17 figures, 9 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.

[NLP-84] From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中对响应速度和成本的高要求问题,通过语义缓存(Semantic Caching)技术复用语义相似请求以提升效率。其核心挑战在于传统缓存假设被打破,需设计新的缓存策略来应对语义匹配带来的复杂性。解决方案的关键在于提出两类策略:一是证明最优离线语义缓存策略为NP-hard问题,并设计多项式时间启发式算法;二是构建在线语义感知缓存策略,融合局部性(locality)、频率(frequency)与近期性(recency)三重维度,显著提升语义准确性。实验表明,该方法在多个数据集上优于基于频率的基线,同时为未来优化提供了明确方向。

链接: https://arxiv.org/abs/2603.03301
作者: Dvir David Biton,Roy Friedman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) has created demand for faster responses and lower costs. Semantic caching, reusing semantically similar requests via their embeddings, addresses this need but breaks classic cache assumptions and raises new challenges. In this paper, we explore offline policies for semantic caching, proving that implementing an optimal offline policy is NP-hard, and propose several polynomial-time heuristics. We also present online semantic aware cache policies that combine recency, frequency, and locality. Evaluations on diverse datasets show that while frequency based policies are strong baselines, our novel variant improves semantic accuracy. Our findings reveal effective strategies for current systems and highlight substantial headroom for future innovation. All code is open source.

[NLP-85] Benchmarking Legal RAG : The Promise and Limits of AI Statutory Surveys

【速读】: 该论文旨在解决法律领域中检索增强生成(Retrieval-Augmented Generation, RAG)模型缺乏系统性评估基准的问题,特别是在多州失业保险法规这一复杂、高精度场景下的表现不佳。其解决方案的关键在于引入并扩展LaborBench基准测试平台,通过与美国劳工部(U.S. Department of Labor, DOL) attorneys历时数月人工整理的法定要求进行比对,对三种新兴工具(包括自研的Statutory Research Assistant, STARA及两个商业平台Westlaw AI和Lexis+ AI)进行全面评估。研究发现,STARA在准确率上显著优于传统RAG(从70%提升至83%),且通过细致的错误分析揭示出部分看似AI错误实为DOL人工遗漏,使得STARA实际准确率高达92%,从而为构建具备跨司法管辖区法律推理能力的RAG系统提供了可操作的设计原则与实践路径。

链接: https://arxiv.org/abs/2603.03300
作者: Mohamed Afane,Emaan Hariri,Derek Ouyang,Daniel E. Ho
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the 5th ACM Symposium on Computer Science and Law (CSLaw '26)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA’s actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.

[NLP-86] How LLM s Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在学术写作中生成虚假引用(即“引用幻觉”)的问题,这一现象在不同模型、学科领域和提示条件下的发生率尚不明确。研究通过大规模审计(共生成69,557条引用实例,并与CrossRef、OpenAlex和Semantic Scholar三个学术数据库比对)揭示了幻觉率在11.4%至56.8%之间波动,且显著受模型类型、学科领域和提示方式影响;关键发现包括:幻觉行为是提示诱导而非模型固有特性;提出两种高效过滤策略——多模型共识(3个以上模型一致引用时准确率达95.6%)和同提示内重复(重复出现2次以上时准确率达88.9%);并开发了一个基于书目字符串特征的轻量级分类器,在无需外部数据库查询的情况下实现AUC 0.834的泛化性能,可作为推理阶段的前置筛查工具。

链接: https://arxiv.org/abs/2603.03299
作者: MZ Naser
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been noted to fabricate scholarly citations, yet the scope of this behavior across providers, domains, and prompting conditions remains poorly quantified. We present one of the largest citation hallucination audits to date, in which 10 commercially deployed LLMs were prompted across four academic domains, generating 69,557 citation instances verified against three scholarly databases (namely, CrossRef, OpenAlex, and Semantic Scholar). Our results show that the observed hallucination rates span a fivefold range (between 11.4% and 56.8%) and are strongly shaped by model, domain, and prompt framing. Our results also show that no model spontaneously generates citations when unprompted, which seems to establish hallucination as prompt-induced rather than intrinsic. We identify two practical filters: 1) multi-model consensus (with more than 3 LLMs citing the same work yields 95.6% accuracy, a 5.8-fold improvement), and 2) within-prompt repetition (with more than 2 replications yields 88.9% accuracy). In addition, we present findings on generational model tracking, which reveal that improvements are not guaranteed when deploying newer LLMs, and on capacity scaling, which appears to reduce hallucination within model families. Finally, a lightweight classifier trained solely on bibliographic string features is developed to classify hallucinated citations from verified citations, achieving AUC 0.876 in cross-validation and 0.834 in LOMO generalization (without querying any external database). This classifier offers a pre-screening tool deployable at inference time.

[NLP-87] ATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提示词(prompt)敏感性方面的脆弱性问题,即模型行为对提示词的表述方式高度依赖,导致性能不稳定。现有自动化提示工程方法普遍存在三个局限:(i) 需要任务特定的标注训练集,(ii) 依赖昂贵的迭代优化过程生成单一数据集级提示,(iii) 每当新任务出现时必须从头重新运行。解决方案的关键在于提出 TATRA——一种无需标注数据的提示构造方法,通过动态合成与用户指令配对的实例特定少样本示例(instance-specific few-shot prompts),实现按需生成有效的上下文示范(in-context examples)。该方法避免了任务特异性优化循环,在多个文本分类和数学推理基准测试中表现优异,甚至在 GSM8K 和 DeepMath 上达到当前最优水平,表明单个实例级提示构造比耗时的全局优化更有效。

链接: https://arxiv.org/abs/2603.03298
作者: Bartosz Dziuba,Kacper Kuchta,Paweł Batorski,Przemysław Spurek,Paul Swoboda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have improved substantially alignment, yet their behavior remains highly sensitive to prompt phrasing. This brittleness has motivated automated prompt engineering, but most existing methods (i) require a task-specific training set, (ii) rely on expensive iterative optimization to produce a single dataset-level prompt, and (iii) must be rerun from scratch for each new task. We introduce TATRA, a dataset-free prompting method that constructs instance-specific few-shot prompts by synthesizing on-the-fly examples to accompany a user-provided instruction. TATRA requires no labeled training data and avoids task-specific optimization loops, while retaining the benefits of demonstration-based prompting. Across standard text classification benchmarks, TATRA matches or improves over strong prompt-optimization baselines that depend on training data and extensive search. On mathematical reasoning benchmarks, TATRA achieves state-of-the-art performance on GSM8K and DeepMath, outperforming methods that explicitly optimize prompts on those tasks. Our results suggest that per-instance construction of effective in-context examples is more important than running long, expensive optimization loops to produce a single prompt per task. We will make all code publicly available upon acceptance of the paper. Code is available at this https URL

[NLP-88] SR: Test-Time Self-Reflection for Continual Reasoning Improvement

【速读】: 该论文旨在解决测试时训练(Test-time Training)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的两大挑战:一是测试问题通常难度较高,导致自动生成的伪标签不可靠;二是现有方法缺乏有效机制来针对模型特定的推理弱点进行适应,从而造成学习效率低下。解决方案的关键在于提出一种名为TTSR(Test-time Self-Reflective Training)的自反思式测试时自我演化训练框架,其核心机制是利用单一预训练语言模型在测试阶段交替扮演“学生”(Student)和“教师”(Teacher)角色:“学生”负责解决问题并从合成的变体问题中学习,“教师”则分析学生失败的推理轨迹,总结重复出现的推理缺陷,并据此生成针对性的变体问题,从而引导模型在可学习范围内通过持续的自我演化循环实现稳定且高效的推理能力提升。

链接: https://arxiv.org/abs/2603.03297
作者: Haoyang He,Zihua Rong,Liangjie Zhao,Yunjia Zhao,Lan Yang,Honggang Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Southwestern University of Finance and Economics (西南财经大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:Test-time Training enables model adaptation using only test questions and offers a promising paradigm for improving the reasoning ability of large language models (LLMs). However, it faces two major challenges: test questions are often highly difficult, making self-generated pseudo-labels unreliable, and existing methods lack effective mechanisms to adapt to a model’s specific reasoning weaknesses, leading to inefficient learning. To address these issues, we propose \textbfTTSR, a self-reflective test-time self-evolving training framework. TTSR employs a single pretrained language model that alternates between the roles of a \textitStudent and a \textitTeacher at test time. The Student focuses on solving problems and learning from synthesized variant questions, while the Teacher analyzes the Student’s failed reasoning trajectories, summarizes recurring reasoning weaknesses, and synthesizes targeted variant questions accordingly. This process guides the model to improve within a learnable regime through a continual self-evolving loop. Experimental results on multiple challenging mathematical reasoning benchmarks show that TTSR consistently improves reasoning performance and generalizes well across different model backbones and general-domain reasoning tasks. These findings suggest that teacher-mediated self-reflection provides an effective pathway for stable and continual reasoning improvement at test time.

[NLP-89] Language Model Goal Selection Differs from Humans in an Open-Ended Task

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在自主目标选择任务中是否能够有效模拟人类行为,即它们能否作为人类目标选择的可靠代理。研究表明,尽管LLMs被广泛应用于辅助决策场景(如个人助理、科学发现和政策研究),但其在开放性学习任务中表现出与人类显著偏离的行为模式——多数模型倾向于锁定单一解法(奖励黑客行为)或表现不佳,且缺乏个体差异性,即使经过链式思维推理(chain-of-thought reasoning)或人格引导(persona steering)也仅带来有限改进。解决方案的关键在于揭示了当前LLMs在目标探索多样性与适应性方面与人类存在本质差异,强调需谨慎对待其在涉及人类偏好建模的应用中的替代作用。

链接: https://arxiv.org/abs/2603.03295
作者: Gaia Molinaro,Dave August,Danielle Perszyk,Anne G. E. Collins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) get integrated into human decision-making, they are increasingly choosing goals autonomously rather than only completing human-defined ones, assuming they will reflect human preferences. However, human-LLM similarity in goal selection remains largely untested. We directly assess the validity of LLMs as proxies for human goal selection in a controlled, open-ended learning task borrowed from cognitive science. Across four state-of-the-art models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution (reward hacking) or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Even Centaur, explicitly trained to emulate humans in experimental settings, poorly captures people’s goal selection. Chain-of-thought reasoning and persona steering provide limited improvements. These findings highlight the uniqueness of human goal selection, cautioning against replacing it with current models in applications such as personal assistance, scientific discovery, and policy research.

[NLP-90] Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在农业咨询场景中因缺乏事实准确性、建议泛化且沟通风格不符合小农户需求而导致的不可靠推荐问题。其核心解决方案是提出一种混合式LLM架构,关键在于将事实检索与对话生成解耦:通过LoRA(Low-Rank Adaptation)监督微调专家标注的GOLDEN FACTS(原子级、经验证的农业知识单元)以优化事实召回率,同时引入独立的“拼接层”(stitching layer)将检索到的事实转化为符合文化语境且安全可控的响应,从而兼顾准确性和实用性。该方法在印度比哈尔邦多作物场景下验证有效,显著提升F1分数和事实一致性,同时使用较小模型即可达到与前沿模型相当的性能,成本大幅降低。

链接: https://arxiv.org/abs/2603.03294
作者: Sanyam Singh,Naga Ganesh,Vineet Singh,Lakshmi Pedapudi,Ritesh Kumar,SSP Jyothi,Archana Karanam,C. Yashoda,Mettu Vijaya Rekha Reddy,Shesha Phani Debbesa,Chandan Dash
机构: Digital Green(数字绿洲)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.

[NLP-91] SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统在在线搜索过程中因累积无关或噪声文档而导致的效率低下,以及依赖稀疏强化学习信号导致训练困难的问题。其解决方案的关键在于提出一种自进化搜索(Self-Evolving Search, SE-Search)代理,通过三个核心组件实现:记忆净化(memory purification)以保留关键证据并过滤冗余内容,原子化查询训练(atomic query training)促进更短且多样化的查询策略以提升信息获取质量,以及密集奖励机制(dense rewards)提供细粒度反馈以加速模型训练。该方法遵循“思考-搜索-记忆”(Think-Search-Memorize)策略,在单跳和多跳问答任务上显著优于现有基线,例如在绝对指标上比Search-R1提升10.8点,相对提升达33.8%。

链接: https://arxiv.org/abs/2603.03293
作者: Jian Li,Yizhang Jin,Dongqi Liu,Hang Ding,Jiafu Wu,Dongsheng Chen,Yunhang Shen,Yulei Qin,Ying Tai,Chengjie Wang,Xiaotong Yuan,Yabiao Wang
机构: Nanjing University (南京大学); Tencent YoutuLab (腾讯优图实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) reduces hallucinations and factual errors in large language models (LLMs) by conditioning generation on retrieved external knowledge. Recent search agents further cast RAG as an autonomous, multi-turn information-seeking process. However, existing methods often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals. We propose \textbfSelf-\textbfEvolving \textbfSearch, a Self-Evolving Search agent that improves online search behavior through three components, memory purification, atomic query training, and dense rewards. SE-Search follows a \textitThink-Search-Memorize strategy that retains salient evidence while filtering irrelevant content. Atomic query training promotes shorter and more diverse queries, improving evidence acquisition. Dense rewards provide fine-grained feedback that speeds training. Experiments on single-hop and multi-hop question answering benchmarks show that \textttSE-Search-3B outperforms strong baselines, yielding a 10.8 point absolute improvement and a 33.8% relative gain over Search-R1.\footnoteWe will make the code and model weights publicly available upon acceptance.

[NLP-92] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

【速读】: 该论文旨在解决基于奖励模型(Reward Model, RM)的偏好微调中存在的奖励黑客(reward hacking)问题,即语言模型(Language Model, LM)从存在偏差或缺陷的RM中学习到不良行为。研究通过系统性测量五个高质量RM发现,即使在最先进的模型中,长度偏倚、谄媚倾向(sycophancy)和过度自信等问题依然存在,并识别出新的偏倚类型,如对特定模型风格和答案顺序的偏好。针对低复杂度偏倚(源于虚假相关),论文提出一种简单的事后干预机制——机械式奖励塑造(mechanistic reward shaping),其关键在于利用最小标注数据有效减少目标偏倚,同时不损害奖励质量,并具备良好的可扩展性和分布外泛化能力。

链接: https://arxiv.org/abs/2603.03291
作者: Daniel Fein,Max Lamparth,Violet Xiang,Mykel J. Kochenderfer,Nick Haber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

信息检索

[IR-0] urning Trust to Transactions: Tracking Affiliate Marketing and FTC Compliance in YouTubes Influencer Economy

【速读】:该论文旨在解决YouTube平台上 affiliate marketing(联盟营销)中存在的透明度与伦理问题,特别是创作者未按规定披露其 affiliate 关系导致的消费者误导风险。研究通过构建基于Web测量和自然语言处理(Natural Language Processing, NLP)最新进展的分析工具,对近10年、涵盖200万条视频和54万创作者的数据集进行系统性分析,揭示了affiliate links在平台上的普遍性及披露合规率低的问题。解决方案的关键在于利用平台层面的标准化披露功能来显著提升合规行为,表明平台治理机制在改善 influencer economy(影响力经济)中透明度与信任方面的核心作用。

链接: https://arxiv.org/abs/2603.04383
作者: Chen Sun,Yash Vekaria,Zubair Shafiq,Rishab Nithyanand
机构: 未知
类目: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: ICWSM 2026

点击查看摘要

Abstract:YouTube has evolved into a powerful platform that where creators monetize their influence through affiliate marketing, raising concerns about transparency and ethics, especially when creators fail to disclose their affiliate relationships. Although regulatory agencies like the US Federal Trade Commission (FTC) have issued guidelines to address these issues, non-compliance and consumer harm persist, and the extent of these problems remains unclear. In this paper, we introduce tools, developed with insights from recent advances in Web measurement and NLP research, to examine the state of the affiliate marketing ecosystem on YouTube. We apply these tools to a 10-year dataset of 2 million videos from nearly 540,000 creators, analyzing the prevalence of affiliate marketing on YouTube and the rates of non-compliant behavior. Our findings reveal that affiliate links are widespread, yet dis- closure compliance remains low, with most videos failing to meet FTC standards. Furthermore, we analyze the effects of different stakeholders in improving disclosure behavior. Our study suggests that the platform is highly associated with improved compliance through standardized disclosure features. We recommend that regulators and affiliate partners collaborate with platforms to enhance transparency, accountability, and trust in the influencer economy.

[IR-1] τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

【速读】:该论文旨在解决当前对话代理(Conversational Agents)在知识密集型场景中,缺乏对检索与工具调用协同能力的综合性评估问题,尤其在长时交互中如何有效整合非结构化外部知识与工具输出以实现可验证且合规的状态变更。解决方案的关键在于提出 τ-Knowledge,作为 τ-Bench 的扩展,构建了一个模拟金融科技客户支持工作流的基准环境 τ-Banking,其中代理需从约700个相互关联的知识文档中精准检索信息,并结合工具调用完成账户更新操作。实验表明,即便使用前沿模型与高推理预算,准确率仍仅约25.5%,凸显了现有模型在复杂知识库检索和内部政策推理上的显著不足,从而为开发真正能集成非结构化知识的实用型智能代理提供了现实测试平台。

链接: https://arxiv.org/abs/2603.04370
作者: Quan Shi,Alexandra Zytek,Pedram Razavi,Karthik Narasimhan,Victor Barres
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 29 pages (10 main + 19 appendix)

点击查看摘要

Abstract:Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce \tau -Knowledge, an extension of \tau -Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, \tau -Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only \sim 25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, \tau -Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

[IR-2] CAMMSR: Category-Guided Attentive Mixture of Experts for Multimodal Sequential Recommendation ICDE2026

【速读】:该论文旨在解决多模态序列推荐(Multimodal Sequential Recommendation)中因依赖启发式融合策略而导致的用户-模态交互动态性与情境敏感性建模不足的问题,尤其针对用户对不同模态(如文本和图像)的偏好存在个体差异及同一用户在不同物品或类别下的变化性,以及模态间协同效应未被充分挖掘的挑战。其解决方案的关键在于提出一种类别引导的注意力专家混合模型(Category-guided Attentive Mixture of Experts, CAMoE),该模块通过辅助类别预测任务动态分配模态权重,从而实现多视角下专用项表示的学习和模态间协同关系的显式建模;同时设计模态交换对比学习任务,借助序列级增强提升跨模态表征对齐能力,最终实现自适应、协同性强且以用户为中心的多模态序列推荐。

链接: https://arxiv.org/abs/2603.04320
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Hewei Wang,Yijie Li,Jianheng Tang,Yunhuai Liu,Edith C. H. Ngai
机构: 未知
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by ICDE 2026

点击查看摘要

Abstract:The explosion of multimedia data in information-rich environments has intensified the challenges of personalized content discovery, positioning recommendation systems as an essential form of passive data management. Multimodal sequential recommendation, which leverages diverse item information such as text and images, has shown great promise in enriching item representations and deepening the understanding of user interests. However, most existing models rely on heuristic fusion strategies that fail to capture the dynamic and context-sensitive nature of user-modal interactions. In real-world scenarios, user preferences for modalities vary not only across individuals but also within the same user across different items or categories. Moreover, the synergistic effects between modalities-where combined signals trigger user interest in ways isolated modalities cannot-remain largely underexplored. To this end, we propose CAMMSR, a Category-guided Attentive Mixture of Experts model for Multimodal Sequential Recommendation. At its core, CAMMSR introduces a category-guided attentive mixture of experts (CAMoE) module, which learns specialized item representations from multiple perspectives and explicitly models inter-modal synergies. This component dynamically allocates modality weights guided by an auxiliary category prediction task, enabling adaptive fusion of multimodal signals. Additionally, we design a modality swap contrastive learning task to enhance cross-modal representation alignment through sequence-level augmentation. Extensive experiments on four public datasets demonstrate that CAMMSR consistently outperforms state-of-the-art baselines, validating its effectiveness in achieving adaptive, synergistic, and user-centric multimodal sequential recommendation. Comments: Accepted by ICDE 2026 Subjects: Information Retrieval (cs.IR); Multimedia (cs.MM) Cite as: arXiv:2603.04320 [cs.IR] (or arXiv:2603.04320v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.04320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance

【速读】:该论文旨在解决音乐信息检索(Music Information Retrieval, MIR)领域中因生成式AI和大型音频语言模型(Large Audio Language Models, LALMs)发展所引发的静态标签体系无法满足人类意图对齐需求的问题,尤其针对当前缺乏能够捕捉音频标注主观性特征的开源基础设施这一瓶颈。其解决方案的关键在于提出一个名为LabelBuddy的开源协作式自动标记音频标注工具,通过容器化后端将用户界面与推理过程解耦,支持用户接入自定义模型进行AI辅助预标注,并具备多用户共识机制、模型隔离能力及扩展大型音频语言模型(LALMs)和自主AI代理的路线图。

链接: https://arxiv.org/abs/2603.04293
作者: Ioannis Prokopiou,Ioannis Sina,Agisilaos Kounelis,Pantelis Vikatos,Themos Stafylakis
机构: Athens University of Economics and Business (雅典经济与商业大学); Orfium; University of Patras (帕特雷大学); Archimedes/Athena R.C. (阿基米德/雅典研究中心)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at NLP4MusA 2026 (4th Workshop on NLP for Music and Audio)

点击查看摘要

Abstract:The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human-aligned representation learning. However, the scarcity of open-source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces \textbfLabelBuddy, an open-source collaborative auto-tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI-assisted pre-annotation. We describe the system architecture, which supports multi-user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at this https URL.

[IR-4] Constraint-Aware Generative Re-ranking for Multi-Objective Optimization in Advertising Feeds

【速读】:该论文旨在解决广告推荐流中重排序(reranking)的约束组合优化问题,即在最大化平台收入的同时保障用户体验,传统生成式排序方法因推理延迟高和约束处理能力有限难以部署。其解决方案的关键在于提出一种约束感知的生成式重排序框架,将约束优化转化为受限神经解码问题:通过统一序列生成与奖励估计为单一网络结构,避免了生成器与评估模型分离的传统范式,并引入约束感知奖励剪枝机制,直接在解码过程中集成约束满足条件,从而高效生成满足多约束的最优排序序列。该方法在大规模工业级广告流上的离线实验和在线A/B测试中均验证了其在提升收入与用户参与度的同时满足严格延迟要求的有效性。

链接: https://arxiv.org/abs/2603.04227
作者: Chenfei Li,Hantao Zhao,Weixi Yao,Ruiming Huang,Rongrong Lu,Geng Tian,Dongying Kong
机构: Bilibili Inc (哔哩哔哩公司)
类目: Information Retrieval (cs.IR)
备注: 14 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Optimizing reranking in advertising feeds is a constrained combinatorial problem, requiring simultaneous maximization of platform revenue and preservation of user experience. Recent generative ranking methods enable listwise optimization via autoregressive decoding, but their deployment is hindered by high inference latency and limited constraint handling. We propose a constraint-aware generative reranking framework that transforms constrained optimization into bounded neural decoding. Unlike prior approaches that separate generator and evaluator models, our framework unifies sequence generation and reward estimation into a single network. We further introduce constraint-aware reward pruning, integrating constraint satisfaction directly into decoding to efficiently generate optimal sequences. Experiments on large-scale industrial feeds and online A/B tests show that our method improves revenue and user engagement while meeting strict latency requirements, providing an efficient neural solution for constrained listwise optimization. Comments: 14 pages, 2 figures, 3 tables Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.04227 [cs.IR] (or arXiv:2603.04227v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.04227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

【速读】:该论文旨在解决Transformer在工业级排序模型中应用受限的问题,主要挑战在于高特征稀疏性和低标签密度导致的训练不稳定与性能瓶颈。解决方案的关键在于提出SORT(Systematically Optimized Ranking Transformer),通过请求中心的样本组织、局部注意力机制、查询剪枝和生成式预训练等优化策略,有效缓解上述问题;同时对分词、多头注意力(Multi-Head Attention, MHA)和前馈网络(Feed-Forward Network, FFN)模块进行系统性改进,提升模型稳定性与容量,并通过训练系统优化将模型浮点运算利用率(Model FLOPs Utilization, MFU)提升至22%,从而实现高效部署与卓越可扩展性。

链接: https://arxiv.org/abs/2603.03988
作者: Chunqi Wang,Bingchao Wu,Taotian Pang,Jiahao Wang,Jie Yang,Jia Liu,Hao Zhang,Hai Zhu,Lei Shen,Shizhun Wang,Bing Wang,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While Transformers have achieved remarkable success in LLMs through superior scalability, their application in industrial-scale ranking models remains nascent, hindered by the challenges of high feature sparsity and low label density. In this paper, we propose SORT (Systematically Optimized Ranking Transformer), a scalable model designed to bridge the gap between Transformers and industrial-scale ranking models. We address the high feature sparsity and low label density challenges through a series of optimizations, including request-centric sample organization, local attention, query pruning and generative pre-training. Furthermore, we introduce a suite of refinements to the tokenization, multi-head attention (MHA), and feed-forward network (FFN) modules, which collectively stabilize the training process and enlarge the model capacity. To maximize hardware efficiency, we optimize our training system to elevate the model FLOPs utilization (MFU) to 22%. Extensive experiments demonstrate that SORT outperforms strong baselines and exhibits excellent scalability across data size, model size and sequence length, while remaining flexible at integrating diverse features. Finally, online A/B testing in large-scale e-commerce scenarios confirms that SORT achieves significant gains in key business metrics, including orders (+6.35%), buyers (+5.97%) and GMV (+5.47%), while simultaneously halving latency (-44.67%) and doubling throughput (+121.33%).

[IR-6] DisenReason : Behavior Disentanglement and Latent Reasoning for Shared-Account Sequential Recommendation

【速读】:该论文旨在解决共享账户场景下顺序推荐(Shared-account Sequential Recommendation, SSR)中存在的用户数量不确定问题,即现有方法通常假设每个账户有固定数量的潜在用户,难以适应多样化的共享模式,从而影响推荐准确性。其解决方案的关键在于提出一种两阶段推理框架DisenReason:第一阶段从频域角度进行行为解耦,构建统一的账户级行为表示;第二阶段以该表示为锚点,推断账户背后的潜在用户数量,从而将传统“从用户行为中推理偏好”转变为“从账户行为中推理用户”,显著提升了推荐效果,在多个基准数据集上实现了最高达12.56%的MRR@5提升。

链接: https://arxiv.org/abs/2603.03782
作者: Jiawei Cheng,Min Gao,Zongwei Wang,Xiaofei Zhu,Zhiyi Liu,Wentao Li,Wei Li,Huan Wu
机构: Chongqing University (重庆大学); Chongqing University of Technology (重庆理工大学); University of Leicester (莱斯特大学); Tongji University (同济大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 17pages, 5figures

点击查看摘要

Abstract:Shared-account usage is common on streaming and e-commerce platforms, where multiple users share one account. Existing shared-account sequential recommendation (SSR) methods often assume a fixed number of latent users per account, limiting their ability to adapt to diverse sharing patterns and reducing recommendation accuracy. Recent latent reasoning technique applied in sequential recommendation (SR) generate intermediate embeddings from the user embedding (e.g, last item embedding) to uncover users’ potential interests, which inspires us to treat the problem of inferring the number of latent users as generating a series of intermediate embeddings, shifting from inferring preferences behind user to inferring the users behind account. However, the last item cannot be directly used for reasoning in SSR, as it can only represent the behavior of the most recent latent user, rather than the collective behavior of the entire account. To address this, we propose DisenReason, a two-stage reasoning method tailored to SSR. DisenReason combines behavior disentanglement stage from frequency-domain perspective to create a collective and unified account behavior representation, which serves as a pivot for latent user reasoning stage to infer the number of users behind the account. Experiments on four benchmark datasets show that DisenReason consistently outperforms all state-of-the-art baselines across four benchmark datasets, achieving relative improvements of up to 12.56% in MRR@5 and 6.06% in Recall@20.

[IR-7] Not All Candidates are Created Equal: A Heterogeneity-Aware Approach to Pre-ranking in Recommender Systems WWW’26

【速读】:该论文针对预排序(pre-ranking)阶段中因训练样本来源异质性(如粗粒度召回结果、细粒度排序信号和曝光反馈)导致的梯度冲突问题展开研究。现有方法 indiscriminately 混合不同来源的样本,造成难样本主导训练过程而易样本被忽视,同时统一模型复杂度分配导致计算资源浪费。解决方案的关键在于提出一种异质性感知的自适应预排序框架(Heterogeneity-Aware Adaptive Pre-ranking, HAP),其核心机制包括:1)通过冲突敏感采样分离易样本与难样本,并设计差异化损失函数以缓解梯度冲突;2)基于样本难度动态分配计算预算——对所有候选者使用轻量模型实现高效覆盖,仅对难样本启用更强模型,在保持精度的同时显著降低计算成本。该方案已在今日头条生产系统部署9个月,效果提升明显且无需额外算力投入。

链接: https://arxiv.org/abs/2603.03770
作者: Pengfei Tong,Siyuan Chen,Chenwei Zhang,Bo Wang,Qi Pi,Pixun Li,Zuotao Liu
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by WWW’26

点击查看摘要

Abstract:Most large-scale recommender systems follow a multi-stage cascade of retrieval, pre-ranking, ranking, and re-ranking. A key challenge at the pre-ranking stage arises from the heterogeneity of training instances sampled from coarse-grained retrieval results, fine-grained ranking signals, and exposure feedback. Our analysis reveals that prevailing pre-ranking methods, which indiscriminately mix heterogeneous samples, suffer from gradient conflicts: hard samples dominate training while easy ones remain underutilized, leading to suboptimal performance. We further show that the common practice of uniformly scaling model complexity across all samples is inefficient, as it overspends computation on easy cases and slows training without proportional gains. To address these limitations, this paper presents Heterogeneity-Aware Adaptive Pre-ranking (HAP), a unified framework that mitigates gradient conflicts through conflict-sensitive sampling coupled with tailored loss design, while adaptively allocating computational budgets across candidates. Specifically, HAP disentangles easy and hard samples, directing each subset along dedicated optimization paths. Building on this separation, it first applies lightweight models to all candidates for efficient coverage, and further engages stronger models on the hard ones, maintaining accuracy while reducing cost. This approach not only improves pre-ranking effectiveness but also provides a practical perspective on scaling strategies in industrial recommender systems. HAP has been deployed in the Toutiao production system for 9 months, yielding up to 0.4% improvement in user app usage duration and 0.05% in active days, without additional computational cost. We also release a large-scale industrial hybrid-sample dataset to enable the systematic study of source-driven candidate heterogeneity in pre-ranking.

[IR-8] AgentS elect: Benchmark for Narrative Query-to-Agent Recommendation

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理生态系统中缺乏系统性方法来选择最优部署配置的问题。现有评估体系多孤立地评测组件,且在任务、指标和候选集上高度碎片化,导致无法有效学习如何根据具体查询推荐端到端的代理配置(即骨干模型与工具包的组合)。其解决方案的关键在于提出AgentSelect基准,将代理选择重构为基于能力特征的叙事式查询到代理推荐问题,并系统性地将异构评估数据转化为统一的正向交互数据;该基准包含111,179个查询、107,721个可部署代理及251,103条交互记录,覆盖LLM-only、工具包-only和组合型代理,实验证明其能捕捉从密集头部复用到长尾、近乎一次性监督的范式转变,并揭示内容感知的能力匹配机制的重要性,同时支持可控反事实编辑下的能力敏感行为建模,提升真实组合场景下的覆盖率,且训练模型可在未见过的代理市场(如MuleRun)实现迁移增益,从而首次构建了统一的数据与评估基础设施,为代理推荐研究提供可复现的基础。

链接: https://arxiv.org/abs/2603.03761
作者: Yunxiao Shi,Wujiang Xu,Tingwei Chen,Haoning Shang,Ling Yang,Yunfeng Wan,Zhuo Cao,Xing Zi,Dimitris N. Metaxas,Min Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: under review by conference

点击查看摘要

Abstract:LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.

[IR-9] SafeCRS: Personalized Safety Alignment for LLM -Based Conversational Recommender Systems

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLM)的对话推荐系统(Conversational Recommender Systems, CRS)中存在的个性化安全漏洞问题,即在用户对话中隐式推断出个体敏感信息(如创伤触发因素、自伤史或恐惧症)后,系统仍可能生成违反这些个性化安全约束的推荐内容。解决方案的关键在于提出SafeCRS框架,其核心是融合Safe Supervised Fine-Tuning(Safe-SFT)与Safe Group reward-Decoupled Normalization Policy Optimization(Safe-GDPO),通过联合优化推荐质量与个性化安全对齐,显著降低安全违规率(最高达96.5%),同时保持推荐性能竞争力。

链接: https://arxiv.org/abs/2603.03536
作者: Haochang Hao,Yifan Xu,Xinzhuo Li,Yingqiang Ge,Lu Cheng
机构: University of Illinois at Chicago (芝加哥大学伊利诺伊分校); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Current LLM-based conversational recommender systems (CRS) primarily optimize recommendation accuracy and user satisfaction. We identify an underexplored vulnerability in which recommendation outputs may negatively impact users by violating personalized safety constraints, when individualized safety sensitivities – such as trauma triggers, self-harm history, or phobias – are implicitly inferred from the conversation but not respected during recommendation. We formalize this challenge as personalized CRS safety and introduce SafeRec, a new benchmark dataset designed to systematically evaluate safety risks in LLM-based CRS under user-specific constraints. To further address this problem, we propose SafeCRS, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment. Extensive experiments on SafeRec demonstrate that SafeCRS reduces safety violation rates by up to 96.5% relative to the strongest recommendation-quality baseline while maintaining competitive recommendation quality. Warning: This paper contains potentially harmful and offensive content.

[IR-10] Graph Hopfield Networks: Energy-Based Node Classification with Associative Memory ICLR

【速读】:该论文旨在解决图神经网络在节点分类任务中面临的关键挑战:如何有效融合记忆检索与图结构平滑机制,以提升模型在稀疏网络和特征遮蔽等复杂场景下的性能。其解决方案的核心是提出Graph Hopfield Networks,通过设计一个联合能量函数,将联想记忆检索(associative memory retrieval)与图拉普拉斯平滑(graph Laplacian smoothing)耦合起来;梯度下降优化该能量函数产生一种迭代更新机制,交替执行Hopfield记忆检索与拉普拉斯传播。这种架构不仅在稀疏引用网络上带来最高达2.0个百分点(pp)的性能提升,在特征遮蔽下还能额外获得5 pp的鲁棒性增强,且其内在的迭代能量下降结构本身即构成强归纳偏置,无需修改架构即可实现异质性(heterophilous)图数据的锐化(sharpening)。

链接: https://arxiv.org/abs/2603.03464
作者: Abinav Rao,Alex Wa,Rishi Athavale
机构: Stanford University (斯坦福大学); Google (谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 Pages, 4 Figures, Acceptted at ICLR NFAM Workshop 2026

点击查看摘要

Abstract:We introduce Graph Hopfield Networks, whose energy function couples associative memory retrieval with graph Laplacian smoothing for node classification. Gradient descent on this joint energy yields an iterative update interleaving Hopfield retrieval with Laplacian propagation. Memory retrieval provides regime-dependent benefits: up to 2.0~pp on sparse citation networks and up to 5 pp additional robustness under feature masking; the iterative energy-descent architecture itself is a strong inductive bias, with all variants (including the memory-disabled NoMem ablation) outperforming standard baselines on Amazon co-purchase graphs. Tuning enables graph sharpening for heterophilous benchmarks without architectural changes.

[IR-11] MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长时间任务中维持高效长时记忆的挑战,现有方法普遍存在成本与准确率之间的权衡问题:简单存储策略难以有效检索相关信息,而复杂索引机制(如记忆图)则计算开销大且易导致信息丢失;同时,依赖主LLM处理全部记忆会显著增加推理延迟和资源消耗。解决方案的关键在于提出MemSifter框架,其核心创新是将记忆检索过程卸载至一个小型代理模型(proxy model),由该模型在任务执行前进行推理并筛选必要记忆,从而避免加重主LLM负担。为优化代理模型,作者引入一种面向记忆的强化学习(Reinforcement Learning, RL)训练范式,设计基于主LLM任务完成效果的任务导向奖励函数,通过多轮交互量化检索记忆的实际贡献,并以阶梯式递减方式区分不同排名的记忆相关性。此外,结合课程学习(Curriculum Learning)和模型融合(Model Merging)等技术进一步提升性能。实验证明,MemSifter在多个LLM记忆基准测试中实现了优于或相当的检索准确性和任务完成率,提供了一种高效、可扩展的长时记忆管理方案。

链接: https://arxiv.org/abs/2603.03379
作者: Jiejun Tan,Zhicheng Dou,Liancheng Zhang,Yuyang Hu,Yiruo Cheng,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Code and datasets are available at this https URL

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly used for long-duration tasks, maintaining effective long-term memory has become a critical challenge. Current methods often face a trade-off between cost and accuracy. Simple storage methods often fail to retrieve relevant information, while complex indexing methods (such as memory graphs) require heavy computation and can cause information loss. Furthermore, relying on the working LLM to process all memories is computationally expensive and slow. To address these limitations, we propose MemSifter, a novel framework that offloads the memory retrieval process to a small-scale proxy model. Instead of increasing the burden on the primary working LLM, MemSifter uses a smaller model to reason about the task before retrieving the necessary information. This approach requires no heavy computation during the indexing phase and adds minimal overhead during inference. To optimize the proxy model, we introduce a memory-specific Reinforcement Learning (RL) training paradigm. We design a task-outcome-oriented reward based on the working LLM’s actual performance in completing the task. The reward measures the actual contribution of retrieved memories by mutiple interactions with the working LLM, and discriminates retrieved rankings by stepped decreasing contributions. Additionally, we employ training techniques such as Curriculum Learning and Model Merging to improve performance. We evaluated MemSifter on eight LLM memory benchmarks, including Deep Research tasks. The results demonstrate that our method meets or exceeds the performance of existing state-of-the-art approaches in both retrieval accuracy and final task completion. MemSifter offers an efficient and scalable solution for long-term LLM memory. We have open-sourced the model weights, code, and training data to support further research.

[IR-12] Combating data scarcity in recommendation services: Integrating cognitive types of VARK and neural network technologies (LLM )

【速读】:该论文旨在解决推荐系统中的冷启动(cold start)问题,即在用户缺乏交互历史或物品 metadata 稀疏的情况下难以生成有效推荐。其解决方案的关键在于提出一个融合大型语言模型(Large Language Models, LLMs)与基于 VARK(Visual, Auditory, Reading/Writing, Kinesthetic)学习偏好认知建模的混合框架:通过 LLM 对物品内容进行语义增强和知识图谱构建,结合最小数据量下的用户认知特征画像,实现动态调整推荐呈现形式,并通过图增强检索与迭代学习机制提升个性化和可解释性推荐能力。

链接: https://arxiv.org/abs/2603.03309
作者: Nikita Zmanovskii
机构: Russian Biotechnological University (ROSBIOTECH)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages, 2 tables

点击查看摘要

Abstract:Cold start scenarios present fundamental obstacles to effective recommendation generation, particularly when dealing with users lacking interaction history or items with sparse metadata. This research proposes an innovative hybrid framework that leverages Large Language Models (LLMs) for content semantic analysis and knowledge graph development, integrated with cognitive profiling based on VARK (Visual, Auditory, Reading/Writing, Kinesthetic) learning preferences. The proposed system tackles multiple cold start dimensions: enriching inadequate item descriptions through LLM processing, generating user profiles from minimal data, and dynamically adjusting presentation formats based on cognitive assessment. The framework comprises six integrated components: semantic metadata enhancement, dynamic graph construction, VARK-based profiling, mental state estimation, graph-enhanced retrieval with LLM-powered ranking, and adaptive interface design with iterative learning. Experimental validation on MovieLens-1M dataset demonstrates the system’s capacity for personalized recommendation generation despite limited initial information. This work establishes groundwork for cognitively-aware recommendation systems capable of overcoming cold start limitations through semantic comprehension and psychological modeling, offering personalized, explainable recommendations from initial user contact.

[IR-13] Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs

【速读】:该论文旨在解决州交通机构在知识管理与决策支持方面面临的挑战,包括传统静态文档、课堂培训和非正式师徒制导致的知识碎片化、效率低下以及资深工程师退休引发的专业知识流失问题;同时,面对海量技术手册、指南和研究报告,工程师难以快速准确获取所需信息,从而影响现场问题处理和新员工培训效率。解决方案的关键在于提出一种基于多智能体架构的检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过多个专业化智能体(检索、生成、评估、查询优化)实现迭代改进与质量控制,并引入开放权重视觉-语言模型将技术图表转化为语义文本表示,使图示知识可被索引与检索,最终由开放权重大语言模型(Large Language Model, LLM)基于检索到的文本和图像上下文生成精准响应,显著提升知识获取效率与决策准确性。

链接: https://arxiv.org/abs/2603.03302
作者: Divija Amaram,Lu Gao,Gowtham Reddy Gudla,Tejaswini Sanjay Katale
机构: University of Houston (休斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Effective knowledge management is critical for preserving institutional expertise and improving the efficiency of workforce training in state transportation agencies. Traditional approaches, such as static documentation, classroom-based instruction, and informal mentorship, often lead to fragmented knowledge transfer, inefficiencies, and the gradual loss of expertise as senior engineers retire. Moreover, given the enormous volume of technical manuals, guidelines, and research reports maintained by these agencies, it is increasingly challenging for engineers to locate relevant information quickly and accurately when solving field problems or preparing for training tasks. These limitations hinder timely decision-making and create steep learning curves for new personnel in maintenance and construction operations. To address these challenges, this paper proposes a Retrieval-Augmented Generation (RAG) framework with a multi-agent architecture to support knowledge management and decision making. The system integrates structured document retrieval with real-time, context-aware response generation powered by a large language model (LLM). Unlike conventional single-pass RAG systems, the proposed framework employs multiple specialized agents for retrieval, answer generation, evaluation, and query refinement, which enables iterative improvement and quality control. In addition, the system incorporates an open-weight vision-language model to convert technical figures into semantic textual representations, which allows figure-based knowledge to be indexed and retrieved alongside text. Retrieved text and figure-based context are then provided to an open-weight large language model, which generates the final responses grounded in the retrieved evidence.

[IR-14] PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂环境中长期记忆不足的问题,现有记忆设计要么任务特定且不可迁移,要么任务无关但因信息冗余和上下文爆炸导致效果不佳。其解决方案的关键在于提出 PlugMem——一个无需针对具体任务重新设计的、任务无关的插件式记忆模块,通过借鉴认知科学原理,将情景记忆结构化为紧凑、可扩展的知识中心记忆图(knowledge-centric memory graph),显式表示命题知识(propositional knowledge)与规范性知识(prescriptive knowledge),从而实现基于任务相关知识的高效检索与推理,而非原始轨迹的冗长文本匹配;该方法区别于 GraphRAG 等图结构方法,以“知识”作为记忆访问与组织的基本单元,显著提升信息密度与泛化能力。

链接: https://arxiv.org/abs/2603.03296
作者: Ke Yang,Zixi Chen,Xuan He,Jize Jiang,Michel Galley,Chenglong Wang,Jianfeng Gao,Jiawei Han,ChengXiang Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long-term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task-specific and non-transferable, or task-agnostic but less effective due to low task-relevance and context explosion from raw memory retrieval. We propose PlugMem, a task-agnostic plugin memory module that can be attached to arbitrary LLM agents without task-specific redesign. Motivated by the fact that decision-relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge-centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task-relevant knowledge, rather than verbose raw trajectories, and departs from other graph-based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long-horizon conversational question answering, multi-hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task-agnostic baselines and exceeds task-specific memory designs, while also achieving the highest information density under a unified information-theoretic analysis. Code and data are available at this https URL.

[IR-15] From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agent ic RAG

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗问答任务中因幻觉(hallucination)和过时知识带来的高风险问题,同时克服现有检索增强生成(Retrieval-Augmented Generation, RAG)方法依赖噪声级标记信号且缺乏多轮精炼能力的局限性。其解决方案的关键在于提出MA-RAG(Multi-Round Agentic RAG)框架,通过代理驱动的迭代优化机制,在测试时实现复杂医学推理的逐步提升:每轮迭代中,代理将候选回答间的语义冲突转化为可操作的查询以获取外部证据,并优化内部推理历史以缓解长上下文退化;该方法扩展了自一致性(self-consistency)原则,将不一致性作为主动信号引导多轮推理与检索,并模拟一种“提升”机制,持续最小化残差误差直至达成稳定、高保真的医学共识。

链接: https://arxiv.org/abs/2603.03292
作者: Wenhao Wu,Zhentao Tang,Yafu Li,Shixiong Kai,Mingxuan Yuan,Zhenhong Sun,Chunlin Chen,Zhi Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 22 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical QA benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at [this url](this https URL).

[IR-16] AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

【速读】:该论文旨在解决长时程大语言模型(Large Language Model, LLM)代理在固定上下文预算下面临的两个核心挑战:一是断连证据(disconnected evidence),即多跳问答需要链接跨时间分布的事实;二是状态更新(state updates),即随时间演化的信息(如日程变更)与静态日志产生冲突。解决方案的关键在于提出一种结构化记忆系统 AriadneMem,其采用解耦的两阶段流水线:在离线构建阶段,通过熵感知门控过滤低信息量消息,并利用冲突感知粗化合并重复静态记录,同时保留状态变迁作为时间边;在在线推理阶段,通过算法桥梁发现重建缺失逻辑路径,并进行拓扑感知单次调用合成。该设计显著提升多跳准确率(Multi-Hop F1 提升 15.2%)和平均 F1(提升 9.0%),同时将总运行时间减少 77.8%,仅需 497 个上下文 token。

链接: https://arxiv.org/abs/2603.03290
作者: Wenhui Zhu,Xiwen Chen,Zhipeng Wang,Jingjing Wang,Xuanzhao Dong,Minzhou Huang,Rui Cai,Hejian Sang,Hao Wang,Peijie Qiu,Yueyue Deng,Prayag Tiwari,Brendan Hogan Rappazzo,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Morgan Stanley (摩根士丹利); Rice University (莱斯大学); Clemson University (克莱姆森大学); Northwestern University (西北大学); UC Davis (加州大学戴维斯分校); Iowa State University (爱荷华州立大学); Washington University in St. Louis (圣路易斯华盛顿大学); Columbia University (哥伦比亚大学); Halmstad University (哈尔姆斯塔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long-term dialogue: (i) \textbfdisconnected evidence, where multi-hop answers require linking facts distributed across time, and (ii) \textbfstate updates, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two-phase pipeline. In the \textbfoffline construction phase, AriadneMem employs \emphentropy-aware gating to filter noise and low-information message before LLM extraction and applies \emphconflict-aware coarsening to merge static duplicates while preserving state transitions as temporal edges. In the \textbfonline reasoning phase, rather than relying on expensive iterative planning, AriadneMem executes \emphalgorithmic bridge discovery to reconstruct missing logical paths between retrieved facts, followed by \emphsingle-call topology-aware synthesis. On LoCoMo experiments with GPT-4o, AriadneMem improves \textbfMulti-Hop F1 by 15.2% and \textbfAverage F1 by 9.0% over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces \textbftotal runtime by 77.8% using only \textbf497 context tokens. The code is available at this https URL.

[IR-17] Stringology-Based Motif Discovery from EEG Signals: an ADHD Case Study

【速读】:该论文旨在解决如何定量分析脑电图(EEG)时间序列中重复性时空模式的问题,以揭示神经信号动态的结构性特征。传统方法多依赖于谱分析或全局复杂度指标,难以捕捉局部时序结构的细微差异。解决方案的关键在于引入字符串学(stringology)中的有序保持匹配(order-preserving matching, OPM)与笛卡尔树匹配(Cartesian tree matching, CTM)算法,从而识别并量化保留相对顺序和层级结构的时序基序(temporal motifs),且对幅度缩放具有不变性。该框架首次将OPM和CTM应用于多通道EEG数据,发现注意缺陷多动障碍(ADHD)群体存在更高频率、更短长度、更大梯度不稳定性以及更低层级复杂性的基序特征,表明其神经活动在结构、稳定性和组织层次上存在系统性异常。

链接: https://arxiv.org/abs/2603.03476
作者: Anat Dahan,Samah Ghazawi
机构: Braude College of Engineering, Karmiel, Israel
类目: Neurons and Cognition (q-bio.NC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We propose a novel computational framework for analyzing electroencephalography (EEG) time series using methods from stringology, the study of efficient algorithms for string processing, to systematically identify and characterize recurrent temporal patterns in neural signals. The primary aim is to introduce quantitative measures to understand neural signal dynamics, with the present findings serving as a proof-of-concept. The framework adapts order-preserving matching (OPM) and Cartesian tree matching (CTM) to detect temporal motifs that preserve relative ordering and hierarchical structure while remaining invariant to amplitude scaling. This approach provides a temporally precise representation of EEG dynamics that complements traditional spectral and global complexity analyses. To evaluate its utility, we applied the framework to multichannel EEG recordings from individuals with attention-deficit/hyperactivity disorder (ADHD) and matched controls using a publicly available dataset. Highly recurrent, group-specific motifs were extracted and quantified using both OPM and CTM. The ADHD group exhibited significantly higher motif frequencies, suggesting increased repetitiveness in neural activity. OPM analysis revealed shorter motif lengths and greater gradient instability in ADHD, reflected in larger mean and maximal inter-sample amplitude changes. CTM analysis further demonstrated reduced hierarchical complexity in ADHD, characterized by shallower tree structures and fewer hierarchical levels despite comparable motif lengths. These findings suggest that ADHD-related EEG alterations involve systematic differences in the structure, stability, and hierarchical organization of recurrent temporal patterns. The proposed stringology-based motif framework provides a complementary computational tool with potential applications for objective biomarker development in neurodevelopmental disorders.

人机交互

[HC-0] Scrollytelling as an Alternative Format for Privacy Policies

【速读】:该论文试图解决隐私政策(privacy policies)文本冗长、复杂且极少被用户阅读的问题,从而限制了其在知情同意中的有效性。解决方案的关键在于提出并验证“滚动叙事”(scrollytelling)这一基于滚动驱动的叙事呈现方式,通过将完整的隐私政策文本与动画视觉元素交错呈现,构建动态的阅读体验。实验结果表明,该方法在提升用户参与度、降低认知负荷、增强格式接受意愿和感知清晰度方面优于纯文本形式,同时保持了与其他格式相当的理解准确性和信心水平,为改善隐私政策的可访问性与用户体验提供了有效路径。

链接: https://arxiv.org/abs/2603.04367
作者: Gonzalo Gabriel Méndez,Jose Such
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Inria (法国国家信息与自动化研究院); INGENIO (CSIC-Universitat Politècnica de València) (西班牙科学委员会-瓦伦西亚理工大学联合研究所)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in CHI2026

点击查看摘要

Abstract:Privacy policies are long, complex, and rarely read, which limits their effectiveness in informed consent. We investigate scrollytelling, a scroll-driven narrative approach, as a privacy policy presentation format. We built a prototype that interleaves the full policy text with animated visuals to create a dynamic reading experience. In an online study (N=454), we compared our tool against text, two nutrition-label variants, and a standalone interactive visualization. Scrollytelling improved user experience over text, yielding higher engagement, lower cognitive load, greater willingness to adopt the format, and increased perceived clarity. It also matched other formats on comprehension accuracy and confidence, with only one nutrition-label variant performing slightly better. Changes in perceived understanding, transparency, and trust were small and statistically inconclusive. These findings suggest that scrollytelling can preserve comprehension while enhancing the experience of policy reading. We discuss design implications for accessible policy communication and identify directions for increasing transparency and user trust.

[HC-1] LikeThis! Empowering App Users to Submit UI Improvement Suggestions Instead of Complaints ICSE’26

【速读】:该论文旨在解决移动应用中用户反馈质量低的问题,即用户常提交模糊、无信息量或破坏性的反馈,难以指导开发人员进行有效改进。为应对这一挑战,作者提出了一种基于生成式 AI (Generative AI) 的解决方案 LikeThis!,其关键在于:首先通过分析用户评论和对应截图,自动生成多个具体的设计改进建议;其次引入“解决方案规格说明”(solution specification)作为中间步骤,在生成可视化设计草图前明确改进方向,从而在保持界面一致性的同时避免引入新问题。实验表明,该方法显著提升了反馈的可理解性和可操作性,促进了用户与开发者之间的协作效率。

链接: https://arxiv.org/abs/2603.04245
作者: Jialiang Wei,Ali Ebrahimi Pourasad,Walid Maalej
机构: University of Hamburg(汉堡大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE '26)

点击查看摘要

Abstract:User feedback is crucial for the evolution of mobile apps. However, research suggests that users tend to submit uninformative, vague, or destructive feedback. Unlike recent AI4SE approaches that focus on generating code and other development artifacts, our work aims at empowering users to submit better and more constructive UI feedback with concrete suggestions on how to improve the app. We propose LikeThis!, a GenAI-based approach that takes a user comment with the corresponding screenshot to immediately generate multiple improvement alternatives, from which the user can easily choose their preferred option. To evaluate LikeThis!, we first conducted a model benchmarking study based on a public dataset of carefully critiqued UI designs. The results show that GPT-Image-1 significantly outperformed three other state-of-the-art image generation models in improving the designs to address UI issues while keeping the fidelity and without introducing new issues. An intermediate step in LikeThis! is to generate a solution specification before sketching the design as a key to achieving effective improvement. Second, we conducted a user study with 10 production apps, where 15 users used LikeThis! to submit their feedback on encountered issues. Later, the developers of the apps assessed the understandability and actionability of the feedback with and without generated improvements. The results show that our approach helps generate better feedback from both user and developer perspectives, paving the way for AI-assisted user-developer collaboration.

[HC-2] FeedAIde: Guiding App Users to Submit Rich Feedback Reports by Asking Context-Aware Follow-Up Questions

【速读】:该论文旨在解决移动应用用户反馈与开发者需求之间存在的信息不对称问题,即用户提交的反馈常缺乏具体情境细节,导致报告不完整且需额外沟通澄清,从而降低开发效率。解决方案的关键在于提出一种名为FeedAIde的上下文感知、交互式反馈机制,其核心是利用多模态大语言模型(Multimodal Large Language Models)的推理能力,在用户报告过程中主动捕获上下文信息(如截图),并据此生成自适应追问,引导用户协同完善反馈内容,从而显著提升反馈报告的完整性与对开发者的实用价值。

链接: https://arxiv.org/abs/2603.04244
作者: Ali Ebrahimi Pourasad,Meyssam Saghiri,Walid Maalej
机构: University of Hamburg(汉堡大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for publication at the 13th International Conference on Mobile Software Engineering and Systems (MOBILESoft) 2026

点击查看摘要

Abstract:User feedback is essential for the success of mobile apps, yet what users report and what developers need often diverge. Research shows that users often submit vague feedback and omit essential contextual details. This leads to incomplete reports and time-consuming clarification discussions. To overcome this challenge, we propose FeedAIde, a context-aware, interactive feedback approach that supports users during the reporting process by leveraging the reasoning capabilities of Multimodal Large Language Models. FeedAIde captures contextual information, such as the screenshot where the issue emerges, and uses it for adaptive follow-up questions to collaboratively refine with the user a rich feedback report that contains information relevant to developers. We implemented an iOS framework of FeedAIde and evaluated it on a gym’s app with its users. Compared to the app’s simple feedback form, participants rated FeedAIde as easier and more helpful for reporting feedback. An assessment by two industry experts of the resulting 54 reports showed that FeedAIde improved the quality of both bug reports and feature requests, particularly in terms of completeness. The findings of our study demonstrate the potential of context-aware, GenAI-powered feedback reporting to enhance the experience for users and increase the information value for developers.

[HC-3] Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning

【速读】:该论文旨在解决始终开启的自指视角摄像头(egocentric cameras)在可穿戴设备存储与电池限制下,视频流中冗余帧过多、质量参差不齐的问题,从而提升数据采集效率。解决方案的关键在于利用现代眼动追踪头显提供的无训练侧通道信号,将帧选择分为两个互补维度:通过注视固定(gaze fixation)筛选视觉稳定性(质量),再基于瞳孔反应(pupil response)对保留帧进行新颖性排序,形成双准则帧筛选器(Dual-Criterion Frame Curator)。该方法无需模型推理且在采集时实时运行,显著提升了低预算下的数据质量与任务性能。

链接: https://arxiv.org/abs/2603.04098
作者: Ajan Subramanian,Sumukh Bettadapura,Rohan Sathish
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 14 pages, 4 figures, 3 tables, plus supplementary material

点击查看摘要

Abstract:Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.

[HC-4] EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR CVPR2026

【速读】:该论文旨在解决**第一人称视角下人体姿态估计(egocentric human motion estimation)**的挑战,包括因视角限制导致的身体覆盖不足、频繁遮挡以及标注数据稀缺等问题。其解决方案的关键在于两个核心贡献:一是提出EgoPoseFormer v2模型,采用基于Transformer的架构实现时空一致且空间定位准确的姿态估计,通过身份条件查询(identity-conditioned queries)、多视角空间精修(multi-view spatial refinement)、因果时序注意力(causal temporal attention)等机制,在恒定计算预算下支持关键点与参数化身体表示;二是构建一个不确定性感知的自标签系统(uncertainty-aware semi-supervised auto-labeling system),利用教师-学生框架生成伪标签并结合不确定性蒸馏(uncertainty distillation),使模型能从数千万未标注帧中学习,显著提升泛化能力与精度。

链接: https://arxiv.org/abs/2603.04090
作者: Zhenyu Li,Sai Kumar Dwivedi,Filip Maric,Carlos Chacon,Nadine Bertsch,Filippo Arcadu,Tomas Hodan,Michael Ramamonjisoa,Peter Wonka,Amy Zhao,Robin Kips,Cem Keskin,Anastasia Tkach,Chenhongyi Yang
机构: Meta; KAUST; Max Planck Institute for Intelligent Systems
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.

[HC-5] he Empty Quadrant: AI Teammates for Embodied Field Learning

【速读】:该论文试图解决当前人工智能教育(AIED)研究中长期存在的“静坐假设”(Sedentary Assumption)问题,即AI系统设计普遍局限于静态学习者面对屏幕的场景,忽视了学习者在真实物理空间中的动态、具身化和情境化认知过程。现有移动学习与情境感知系统虽将学习带入实体环境,但AI仍被定位为信息传递工具而非认知伙伴(epistemic partner)。解决方案的关键在于提出“Field Atlas”框架,其核心是基于具身、嵌入、主动和扩展认知(4E cognition)、主动推理(active inference)及双编码理论(dual coding theory),将AIED的指导隐喻从“教学”转向“意义建构”(sensemaking)。该框架通过自愿摄影与即时语音反思的配对、AI仅作为苏格拉底式提问者而非答案提供者,以及应用认知轨迹建模(Epistemic Trajectory Modeling, ETM)来刻画学习者在物理-认知空间中的连续轨迹,从而实现以过程为导向的评估,生成结构上难以被AI伪造的实证证据,推动AIED向野外环境中人机协同的意义建构方向演进。

链接: https://arxiv.org/abs/2603.04034
作者: Hyein Kim,Sung Park
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For four decades, AIED research has rested on what we term the Sedentary Assumption: the unexamined design commitment to a stationary learner seated before a screen. Mobile learning and museum guides have moved learners into physical space, and context-aware systems have delivered location-triggered content – yet these efforts predominantly cast AI in the role of information-de-livery tool rather than epistemic partner. We map this gap through a 2 x 2 matrix (AI Role x Learning Environment) and identify an undertheorized intersection: the configuration in which AI serves as an epistemic teammate during unstruc-tured, place-bound field inquiry and learning is assessed through trajectory rather than product. To fill it, we propose Field Atlas, a framework grounded in embod-ied, embedded, enactive, and extended (4E) cognition, active inference, and dual coding theory that shifts AIED’s guiding metaphor from instruction to sensemak-ing. The architecture pairs volitional photography with immediate voice reflec-tion, constrains AI to Socratic provocation rather than answer delivery, and ap-plies Epistemic Trajectory Modeling (ETM) to represent field learning as a con-tinuous trajectory through conjoined physical-epistemic space. We demonstrate the framework through a museum scenario and argue that the resulting trajecto-ries – bound to a specific body, place, and time – constitute process-based evi-dence structurally resistant to AI fabrication, offering a new assessment paradigm and reorienting AIED toward embodied, dialogic human-AI sensemaking in the wild.

[HC-6] IROSA: Interactive Robot Skill Adaptation using Natural Language

【速读】:该论文旨在解决工业机器人在复杂任务中缺乏灵活适应能力的问题,尤其是在有限数据条件下如何实现高效、安全且可解释的技能迁移。其核心挑战在于如何将大规模预训练语言模型(Large Language Models, LLMs)的能力与机器人控制相结合,同时避免直接交互带来的不稳定性与不可控风险。解决方案的关键在于提出一种基于工具的架构(tool-based architecture),通过引入一个保护性抽象层(protective abstraction layer)隔离语言模型与机器人硬件,并利用预训练LLMs动态选择和参数化特定工具来执行自然语言指令,从而实现无需微调即可完成速度调整、轨迹修正和避障等技能适应,保障了系统的安全性、透明性和可解释性。

链接: https://arxiv.org/abs/2603.03897
作者: Markus Knauer,Samuel Bustamante,Thomas Eiband,Alin Albu-Schäffer,Freek Stulp,João Silvério
机构: German Aerospace Center (DLR); Institute of Robotics and Mechatronics (RMC); Technical University of Munich (TUM)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing

点击查看摘要

Abstract:Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.

[HC-7] On the Suitability of LLM -Driven Agents for Dark Pattern Audits

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的智能体在自主浏览网页时,能否可靠识别界面设计中的“暗黑模式”(dark patterns)这一关键问题。暗黑模式是指通过摩擦、误导或强制等手段影响用户决策的界面设计策略,尤其在涉及消费者数据权利(如CCPA相关请求)的网站中可能显著阻碍用户行使法定权利。解决方案的关键在于构建并部署一个基于大语言模型(LLM)的审计代理(auditing agent),该代理能够端到端完成数据权利请求流程、结构化收集证据,并对潜在暗黑模式进行分类。实验覆盖456个数据经纪商网站,验证了该方法在流程执行一致性、分类可靠性及失败条件分析上的可行性与局限性,为规模化自动化检测暗黑模式提供了技术路径和实证依据。

链接: https://arxiv.org/abs/2603.03881
作者: Chen Sun,Yash Vekaria,Rishab Nithyanand
机构: University of Iowa (爱荷华大学); University of California, Davis (加州大学戴维斯分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As LLM-driven agents begin to autonomously navigate the web, their ability to interpret and respond to manipulative interface design becomes critical. A fundamental question that emerges is: can such agents reliably recognize patterns of friction, misdirection, and coercion in interface design (i.e., dark patterns)? We study this question in a setting where the workflows are consequential: website portals associated with the submission of CCPA-related data rights requests. These portals operationalize statutory rights, but they are implemented as interactive interfaces whose design can be structured to facilitate, burden, or subtly discourage the exercise of those rights. We design and deploy an LLM-driven auditing agent capable of end-to-end traversal of rights-request workflows, structured evidence gathering, and classification of potential dark patterns. Across a set of 456 data broker websites, we evaluate: (1) the ability of the agent to consistently locate and complete request flows, (2) the reliability and reproducibility of its dark pattern classifications, and (3) the conditions under which it fails or produces poor judgments. Our findings characterize both the feasibility and the limitations of using LLM-driven agents for scalable dark pattern auditing.

[HC-8] Understanding Parents Desires in Moderating Childrens Interactions with GenAI Chatbots through LLM -Generated Probes

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 聊天机器人在儿童使用场景中缺乏有效家长控制工具的问题,尤其关注父母如何希望干预和调节儿童与 GenAI 的互动。其解决方案的关键在于通过构建真实且多样化的儿童-GenAI 交互情境(包含提示词和模型回复),系统收集家长对这些情境的担忧、修改偏好及沟通方式反馈,从而揭示出三类核心需求:一是现有 GenAI 家长控制机制忽视的潜在风险;二是家长对对话级粒度的透明度与内容调控的强烈诉求;三是个性化控制机制的重要性,需根据家庭策略和儿童年龄动态调整。这为未来 GenAI 家长控制工具的设计提供了实证依据与方向指引。

链接: https://arxiv.org/abs/2603.03727
作者: John Driscoll,Yulin Chen,Viki Shi,Izak Vucharatavintara,Yaxing Yao,Haojian Jin
机构: University of California, San Diego(加州大学圣地亚哥分校); San Diego State University(圣迭戈州立大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 33 pages, 10 figures, Accepted to ACM CHI 2026

点击查看摘要

Abstract:This paper studies how parents want to moderate children’s interactions with Generative AI chatbots, with the goal of informing the design of future GenAI parental control tools. We first used an LLM to generate synthetic child-GenAI chatbot interaction scenarios and worked with four parents to validate their realism. From this dataset, we carefully selected 12 diverse examples that evoked varying levels of concern and were rated the most realistic. Each example included a prompt and a GenAI chatbot response. We presented these to parents (N=24) and asked whether they found them concerning, why, and how they would prefer the responses to be modified and communicated. Our findings reveal three key insights: (1) parents express concern about interactions that current GenAI chatbot parental controls neglect; (2) parents want fine-grained transparency and moderation at the conversation level; and (3) parents need personalized controls that adapt to their desired strategies and children’s ages.

[HC-9] UrbanHuRo: A Two-Layer Human-Robot Collaboration Framework for the Joint Optimization of Heterogeneous Urban Services ICRA’26

【速读】:该论文旨在解决城市中异构服务(如配送与城市感知)在孤立优化下效率低下、资源利用率不足的问题,尤其关注如何通过人机协同实现多服务间的联合优化。其核心挑战在于不同服务目标可能存在冲突,且需在动态环境中实现实时协调。解决方案的关键在于提出UrbanHuRo框架,包含两个核心设计:一是基于分布式MapReduce的K-子模最大化模块,用于高效订单调度;二是基于深度子模奖励强化学习的感知路径规划算法,以提升感知覆盖范围并增强配送效率。实验表明,该方案在真实数据集上显著提升了感知覆盖率(+29.7%)和骑手收入(+39.2%),同时减少超时订单数量。

链接: https://arxiv.org/abs/2603.03701
作者: Tonmoy Dey,Lin Jiang,Zheng Dong,Guang Wang
机构: Florida State University (佛罗里达州立大学); Wayne State University (韦恩州立大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 8 pages, 15 figures. This paper has been accepted by ICRA’26 as a regular paper

点击查看摘要

Abstract:In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents’ quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to potentially conflicting objectives and the need for real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through crowdsourced delivery and urban sensing. UrbanHuRo includes two key designs: (i) a scalable distributed MapReduce-based K-submodular maximization module for efficient order dispatch, and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.

[HC-10] Echoes of Norms: Investigating Counterspeech Bots Influence on Bystanders in Online Communities

【速读】:该论文旨在解决在线社区中仇恨言论(hate speech)治理问题,特别是现有研究多关注计数言论(counterspeech)聊天机器人对发言者和受害者的干预效果,而忽视了其对旁观者(bystanders)的影响机制。为回应这一空白,作者提出了一种计数言论策略框架,并开发了名为Civilbot的对话系统,通过混合方法的被试内实验验证其作用。研究表明,解决方案的关键在于策略设计的合理性与情境适配性:采用基于认知推理(cognitive strategies)且语气积极的策略能有效提升旁观者的参与意愿;相反,若策略与语境不匹配,则会削弱干预效果甚至引发负面反应。因此,有效的计数言论干预需结合理性引导与情境敏感性设计,以精准动员旁观者并塑造良性网络话语生态。

链接: https://arxiv.org/abs/2603.03687
作者: Mengyao Wang,Shuai Ma,Nuo Li,Peng Zhang,Chenxin Li,Ning Gu,Tun Lu
机构: Fudan University (复旦大学); Chinese Academy of Sciences (中国科学院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Counterspeech offers a non-repressive approach to moderate hate speech in online communities. Research has examined how counterspeech chatbots restrain hate speakers and support targets, but their impact on bystanders remains unclear. Therefore, we developed a counterspeech strategy framework and built \textitCivilbot for a mixed-method within-subjects study. Bystanders generally viewed Civilbot as credible and normative, though its shallow reasoning limited persuasiveness. Its behavioural effects were subtle: when performing well, it could guide participation or act as a stand-in; when performing poorly, it could discourage bystanders or motivate them to step in. Strategy proved critical: cognitive strategies that appeal to reason, especially when paired with a positive tone, were relatively effective, while mismatch of contexts and strategies could weaken impact. Based on these findings, we offer design insights for mobilizing bystanders and shaping online discourse, highlighting when to intervene and how to do so through reasoning-driven and context-aware strategies.

[HC-11] Inclusive Mobile Learning: How Technology-Enabled Language Choice Supports Multilingual Students

【速读】:该论文试图解决的问题是:在全球范围内,尽管多数学习者为多语言使用者,但实践中实施多语言教育仍面临挑战,导致语言多样性学习者难以平等获取优质教育资源。解决方案的关键在于利用教育技术(EdTech)提供本地语言选项,以降低语言障碍并扩大教育可及性。研究通过在乌干达开展准实验,发现提供卢布·兰戈语(Leb Lango)授课选项显著提升了来自农村地区、教育程度较低和先验知识较弱的学习者的参与度与学习成效,且即使选择英语授课的学习者也因本地语言选项的存在而表现出更高活跃度,最终实现与英语和混合语言组相当的学业成果,验证了本地语言支持在提升EdTech包容性中的核心作用。

链接: https://arxiv.org/abs/2603.03675
作者: Phenyo Phemelo Moletsane,Michael W. Asher,Christine Kwon,Paulo F. Carvalho,Amy Ogan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI’26

点击查看摘要

Abstract:Most learners worldwide are multilingual, yet implementing multilingual education remains challenging in practice. EdTech offers an opportunity to bridge this gap and expand access for linguistically diverse learners. We conducted a quasi-experiment in Uganda with 2,931 participants enrolled in a non-formal radio- and mobile-based engineering course, where learners self-selected instruction in Leb Lango (a local language), English, or a Hybrid option combining both languages. The Leb Lango version of the course was used disproportionately by learners from rural areas, those with less formal education, and those with lower prior knowledge, broadening participation among disadvantaged learners. Moreover, the availability of Leb Lango instruction was associated with higher active participation, even among learners who registered for English instruction. Although Leb Lango learners began with lower performance, they demonstrated faster learning gains and achieved comparable final examination outcomes to English and Hybrid learners. These results suggest that providing local language options to learners is an effective way to make EdTech more accessible.

[HC-12] Bridging Pedagogy and Play: Introducing a Language Mapping Interface for Human-AI Co-Creation in Educational Game Design

【速读】:该论文试图解决教育游戏设计中非专家教师难以可靠实现特定学习目标的问题,尽管现有创作环境降低了编程门槛,但仍存在教学意图不透明、依赖黑箱式AI建议等挑战。解决方案的关键在于设计了一个基于受控自然语言框架的网页工具,将语言作为主要交互界面,通过用户与大语言模型(Large Language Model, LLM)助理协作构建一种结构化语言,该语言由四个相互关联的组件构成,明确映射教学法(pedagogy)与游戏机制(gameplay),从而降低设计门槛、保留人类在关键决策中的主导权,并支持设计过程中及之后对教学与游戏之间一致性进行反思与调整。

链接: https://arxiv.org/abs/2603.03644
作者: Daijin Yang,Erica Kleinman,Casper Harteveld
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for CHI EA 26

点击查看摘要

Abstract:Educational games can foster critical thinking, problem-solving, and motivation, yet instructors often find it difficult to design games that reliably achieve specific learning outcomes. Existing authoring environments reduce the need for programming expertise, but they do not eliminate the underlying challenges of educational game design, and they can leave non-expert designers reliant on opaque suggestions from AI systems. We designed a controlled natural language framework-based web tool that positions language as the primary interface for LLM-assisted educational game design. In the tool, users and an LLM assistant collaboratively develop a structured language that maps pedagogy to gameplay through four linked components. We argue that, by making pedagogical intent explicit and editable in the interface, the tool has the potential to lower design barriers for non-expert designers, preserves human agency in critical decisions, and enables alignment and reflections between pedagogy and gameplay during and after co-creation.

[HC-13] Modelling Visuo-Haptic Perception Change in Size Estimation Tasks

【速读】:该论文旨在解决多感官交互中感知漂移(perception drift)的问题,特别是视觉与触觉(haptic)在时间维度上如何协同作用以维持对物体尺寸的准确感知。研究发现,当视觉线索被干扰时,个体依赖先验信念和持续更新的感知循环来调整判断,但这一机制此前尚未被充分理解。解决方案的关键在于提出一个循环自适应模型(cyclical, self-adjusting model),该模型揭示了视觉引导(visual priming)与路径积分(dead-reckoning)在调节跨模态感知一致性中的作用,并表明感知并非静态,而是基于实时感官输入与内部信念的动态融合过程。此模型不仅为虚拟现实(VR)中的错觉感知提供了理论依据,也深化了对视觉与触觉系统协作与分离机制的理解。

链接: https://arxiv.org/abs/2603.03614
作者: Jian Zhang,Wafa Johal,Jarrod Knibbe
机构: University of Melbourne (墨尔本大学); The University of Queensland (昆士兰大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Tangible interactions involve multiple sensory cues, enabling the accurate perception of object properties, such as size. Research has shown, however, that if we decouple these cues (for example, by altering the visual cue), then the resulting discrepancies present new opportunities for interactions. Perception over time though, not only relies on momentary sensory cues, but also on a priori beliefs about the object, implying a continuing update cycle. This cycle is poorly understood and its impact on interaction remains unknown. We study (N=80) visuo-haptic perception of size over time and (a) reveal how perception drifts, (b) examine the effects of visual priming and dead-reckoning, and © present a model of visuo-haptic perception as a cyclical, self-adjusting system. Our work has a direct impact on illusory perception in VR, but also sheds light on how our visual and haptic systems cooperate and diverge.

[HC-14] Are You Comfortable Sharing It?: Leverag ing Image Obfuscation Techniques to Enhance Sharing Privacy for Blind and Visually Impaired Users

【速读】:该论文旨在解决盲视障碍人群(Blind Visual Impairments, BVI)在图像分享过程中可能无意暴露敏感或不适当内容的问题,此类行为可能损害隐私并影响人际关系。解决方案的关键在于通过图像过滤技术帮助BVI用户在分享前识别并处理敏感内容,研究发现像素化(pixelation)是最不受欢迎的过滤方式,而其他滤镜效果则因图像类型和分享对象的不同而呈现差异性偏好;此外,参与者普遍表示使用过滤后的图像进行分享更具安全感。基于实证结果,作者提出了针对性的设计指南以优化BVI用户的图像共享体验。

链接: https://arxiv.org/abs/2603.03606
作者: Satabdi Das,Nahian Beente Firuj,Manjot Singh,Arshad Nasser,Khalad Hasan
机构: The University of British Columbia (不列颠哥伦比亚大学); Shahjalal University of Science and Technology (沙赫贾拉尔科技大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026

点击查看摘要

Abstract:People with Blind Visual Impairments (BVI) face unique challenges when sharing images, as these may accidentally contain sensitive or inappropriate content. In many instances, they are unaware of the potential risks associated with sharing such content, which can compromise their privacy and interpersonal relationships. To address this issue, we investigated image filtering techniques that could help BVI users manage sensitive content before sharing with various audiences, including family, friends, or strangers. We conducted a study with 20 BVI participants, evaluating different filters applied to images varying in sensitivity, such as personal moments or embarrassing shots. Results indicated that pixelation was the least preferred method, while preferences for other filters varied depending on image type and sharing context. Additionally, participants reported greater comfort when sharing filtered versus unfiltered images across audiences. Based on the results, we offer a set of design guidelines to enhance the image-sharing experience for BVI individuals.

[HC-15] Inline Visualization and Manipulation of Real-Time Hardware Log for Supporting Debugging of Embedded Programs

【速读】:该论文旨在解决嵌入式系统(embedded systems)在调试过程中因软硬件问题交织而带来的挑战,尤其针对如Arduino这类用户友好的嵌入式原型开发平台。现有调试工具通常依赖硬件探针或通过串口监视器进行日志可视化,效率较低且不够直观。解决方案的关键在于设计了一种名为Inline的编程工具,其核心创新是将硬件日志直接内嵌到代码中显示,支持实时执行流程追踪,并引入表达式语言用于日志的动态处理与分析,从而显著提升调试效率和用户体验。

链接: https://arxiv.org/abs/2603.03605
作者: Andrea Bianchi,Zhi Lin Yap,Punn Lertjaturaphat,Austin Z. Henley,Kongpyung Justin Moon,Yoonji Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 26 pages, 12 figures

点击查看摘要

Abstract:The development of user-friendly embedded prototyping systems like Arduino has made creating interactive devices more accessible. However, debugging these systems is challenging due to the intertwined nature of software and hardware issues. Existing tools often require hardware instrumentation or log visualization through serial monitors. To address this, the authors designed Inline, a programming tool that simplifies debugging by displaying hardware logs directly within the code, providing real-time execution flow tracking and an expression language for log manipulation. A study with twelve users demonstrated the tool’s effectiveness in aiding debugging tasks.

[HC-16] Human-centered Perspectives on a Clinical Decision Support System for Intensive Outpatient Veteran PTSD Care

【速读】:该论文旨在解决心理治疗中患者自我报告与临床直觉之间张力的协调问题,特别是在针对退伍军人的延长暴露(Prolonged Exposure, PE)疗法中,如何通过技术手段辅助临床决策。其解决方案的关键在于设计一个面向PE疗法的临床决策支持系统(Clinical Decision Support System, CDSS)原型,并通过两阶段访谈研究收集执业PE治疗师和前PE患者(均为美国退伍军人)的反馈,从而提炼出CDSS在实践中的机会(如快速回顾治疗作业、辅助患者概念化)及部署挑战(如适应退伍军人事务部VA的复杂环境)。研究进一步借助分布式认知、情境学习和基础设施反转三个以用户为中心的视角,揭示了为精神科医生设计CDSS时所面临的深层复杂性,并提出了与理论相契合的设计考量。

链接: https://arxiv.org/abs/2603.03467
作者: Cynthia M. Baseman,Myeonghan Ryu,Nathaniel Swinger,Kefan Xu,Andrew M. Sherrill,Rosa I. Arriaga
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University School of Medicine (埃默里大学医学院)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 1 figure. The first two authors contributed equally to this research. Accepted to appear in the Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Psychotherapy delivery relies on a negotiation between patient self-reports and clinical intuition. Growing evidence for technological support of psychotherapy suggests opportunities to aid the mediation of this tension. To explore this prospect, we designed a prototype of a clinical decision support system (CDSS) for treating veterans with post-traumatic stress disorder in a Prolonged Exposure (PE) therapy intensive outpatient program. We conducted a two-phase interview study to collect perspectives from practicing PE clinicians and former PE patients who are United States veterans. Our analysis distills opportunities for a CDSS (e.g., offering homework review at a glance, aiding patient conceptualization) and larger challenges related to context and deployment (e.g., navigating Veterans Affairs). By reframing our findings through three human-centered perspectives (distributed cognition, situated learning, infrastructural inversion), we highlight the complexities of designing a CDSS for psychotherapists in this context and offer theory-aligned design considerations.

[HC-17] Designing with Medical Mistrust: Perspectives from Black Older Adults in Publicly Subsidized Housing

【速读】:该论文旨在解决当前健康技术设计中对种族相关医疗不信任(race-based medical mistrust)关注不足的问题,尤其是黑人低收入老年人群体在医疗系统中的历史创伤与现实困境未被充分纳入人本计算(human-centered computing)的研究视野。其解决方案的关键在于通过深度访谈揭示社区视角下的医疗信任机制,识别出认证与身体体验(accreditation and embodiment)、对经济动机的怀疑以及对健康人工智能(Health AI)意图的质疑等核心主题,并借助黑人女性主义理论(Black Feminist Thought)重构研究发现,提出一套以历史根基为基础的健康自我管理技术设计原则,强调研究者需反思自身位置性(positionality)以实现真正共情和包容的设计实践。

链接: https://arxiv.org/abs/2603.03416
作者: Cynthia M. Baseman,Reeda Shimaz Huda,Rosa I. Arriaga
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 18 pages, 1 figure, Accepted to appear in the Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Despite increasing interest in culturally-sensitive health technologies, medical mistrust remains largely unexplored within human-centered computing. Considered a social determinant of health, medical mistrust is the belief that healthcare providers or institutions are acting against one’s best interest. This is a rational, protective response based on historical context, structural inequities, and discrimination. To center race-based medical mistrust and the lived experiences of Black older adults with low income, we conducted interviews within publicly subsidized housing in the Southern United States. Our reflexive themes describe community perspectives on health care and medical mistrust, including accreditation and embodiment, skepticism of financial motivations, and the intentions behind health AI. We provide a reflective exercise for researchers to consider their positionality in relation to community engagements, and reframe our findings through Black Feminist Thought to propose design principles for health self-management technologies for communities with historically grounded medical mistrust.

[HC-18] Arapai: An Offline-First AI Chatbot Architecture for Low-Connectivity Educational Environments

【速读】:该论文旨在解决当前教育领域中基于大语言模型(Large Language Models, LLMs)的AI聊天机器人普遍依赖互联网连接、云端基础设施和高性能硬件的问题,这些问题加剧了数字鸿沟,限制了在带宽受限和资源匮乏环境中的实际部署。解决方案的关键在于提出一种“离线优先”(offline-first)的AI聊天机器人架构Arapai,其核心包括:本地化部署的量化语言模型、自动适配硬件的模型选择机制,以及基于教学层级的响应控制策略;通过完全在设备端执行推理并保持模型常驻内存以优化性能,实现了课程对齐的解释、结构化问题求解支持及差异化教学深度,从而在无网络环境下仍能提供稳定、高效且具有教育价值的智能辅导服务。

链接: https://arxiv.org/abs/2603.03339
作者: Joseph Walusimbi,Ann Move Oguti,Joshua Benjamin Ssentongo,Keith Ainebyona
机构: 未知
类目: Computers and Society (cs.CY); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 16 pages, 1 table, 12 figures

点击查看摘要

Abstract:The rapid global expansion of large language models (LLMs) has created new opportunities for personalised and inquiry-driven learning. However, most AI chatbot systems for education rely on continuous internet connectivity, cloud infrastructure, and modern hardware. These requirements reinforce digital inequalities and limit the practical deployment of AI-supported learning in bandwidth-constrained and resource-limited environments worldwide. This paper presents Arapai, an offline-first AI chatbot architecture designed to operate entirely without internet connectivity on low-specification, CPU-only devices. The system integrates locally hosted, quantised language models with automatic hardware-aware model selection and pedagogically tiered response control. By performing inference fully on-device and maintaining models resident in memory for performance optimisation, Arapai delivers curriculum-aligned explanations, structured problem-solving support, and differentiated instructional depth without reliance on cloud services. A pilot deployment in secondary and tertiary institutions operating under limited-connectivity conditions evaluated the system across four dimensions: technical performance, usability, perceived answer quality, and educational impact. Results indicate stable operation on legacy hardware, acceptable response times for standard instructional queries, and positive learner and teacher perceptions regarding self-directed learning support. Rather than replacing cloud-based AI systems, this work proposes a complementary deployment paradigm for infrastructure-constrained education systems. The study contributes a hardware-aware architectural framework for decentralised AI tutoring and highlights the role of offline-first design in advancing digital inclusion and infrastructure-resilient educational technology.

[HC-19] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

【速读】:该论文旨在解决从非侵入式脑电图(EEG)信号中解码自然语言时存在的三大核心挑战:语义偏倚(模式坍缩至通用模板)、信号忽视(基于语言先验而非神经输入产生幻觉)以及BLEU指标陷阱(高频停用词导致评估指标虚高,掩盖真实语义保真度不足)。其解决方案的关键在于提出一种名为SemKey的多阶段框架,通过四个解耦的语义目标(情感、主题、长度和意外性)强制生成过程与神经信号对齐;具体而言,重构神经编码器与大语言模型(LLM)之间的交互机制,将语义提示作为查询(Query),EEG嵌入作为键值对(Key-Value),严格引导模型关注神经输入,从而实现信号驱动的生成。此外,采用N-way检索准确率和弗雷歇距离等更严格的评估指标,有效验证了方法在多样性和语义一致性上的显著提升。

链接: https://arxiv.org/abs/2603.03312
作者: Yuchen Wang,Haonan Wang,Yu Guo,Honglong Yang,Xiaomeng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at this https URL.

[HC-20] Deep Sketch-Based 3D Modeling: A Survey

【速读】:该论文旨在解决传统草图驱动的三维建模(Sketch-Based 3D Modeling, SB3DM)中长期存在的草图抽象性与歧义性问题,这些问题限制了用户在交互过程中的灵活性、可用性与表达准确性。其解决方案的关键在于提出了一种基于新设计空间MORPHEUS的深度草图驱动三维建模(Deep Sketch-Based 3D Modeling, DS-3DM)系统框架,该框架以输入-模型-输出(Input-Model-Output, IMO)为基础,将模型输出选项(如3D表示形式和部件)与人类输入(数量与模态多样性)及用户视角和风格评估相结合,从而提升生成结果的可控性和信息丰富度,推动计算机视觉、计算机图形学与人机交互领域的跨学科研究,使设计流程更贴合用户意图。

链接: https://arxiv.org/abs/2603.03287
作者: Alberto Tono,Jiajun Wu,Gordon Wetzstein,Iro Armeni,Hariharan Subramonyam,James Landay,Martin Fischer
机构: Stanford University (斯坦福大学); Computational Design Institute (计算设计研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In the past decade, advances in artificial intelligence have revolutionized sketch-based 3D modeling, leading to a new paradigm known as Deep Sketch-Based 3D Modeling (DS-3DM). DS-3DM offers data-driven methods that address the long-standing challenges of sketch abstraction and ambiguity. DS-3DM keeps humans at the center of the creative process by enhancing the flexibility, usability, faithfulness, and adaptability of sketch-based 3D modeling interfaces. This paper contributes a comprehensive survey of the latest DS-3DM within a novel design space: MORPHEUS. Built upon the Input-Model-Output (IMO) framework, MORPHEUS categorizes Models outputting Options of 3D Representations and Parts, derived from Human inputs (varying in quantity and modality), and Evaluated across diverse User-views and Styles. Throughout MORPHEUS we highlight limitations and identify opportunities for interdisciplinary research in Computer Vision, Computer Graphics, and Human-Computer Interaction, revealing a need for controllability and information-rich outputs. These opportunities align design processes more closely with user’ intent, responding to the growing importance of user-centered approaches.

计算机视觉

[CV-0] SimpliHuMoN: Simplifying Human Motion Prediction

【速读】:该论文旨在解决人类运动预测(human motion prediction)中轨迹预测(trajectory forecasting)与人体姿态预测(pose prediction)任务难以协同建模的问题。现有方法通常针对单一任务设计专用模型,而将两者结合时性能受限,难以在基准测试中取得理想结果。解决方案的关键在于提出一种基于Transformer的统一端到端模型,通过堆叠自注意力模块(self-attention modules)有效捕捉单帧内姿态的空间依赖关系和跨时间步的运动序列时序关系,从而无需任务特定调整即可同时处理姿态仅预测、轨迹仅预测及联合预测任务,并在多个主流数据集(如Human3.6M、AMASS、ETH-UCY和3DPW)上实现最优性能。

链接: https://arxiv.org/abs/2603.04399
作者: Aadya Agrawal,Alexander Schwing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 7 figures. Preprint

点击查看摘要

Abstract:Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.

[CV-1] ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

【速读】:该论文旨在解决当前基于前向传播(feed-forward)的Transformer模型在3D视觉任务中计算复杂度高、难以扩展至大规模图像集合的问题。具体而言,现有方法如VGGT和π³的计算成本随输入图像数量呈二次增长,导致在处理大量图像时效率低下;而顺序重建方法虽降低了计算开销,却牺牲了重建质量。论文提出的解决方案是引入ZipMap——一种具有状态记忆能力的前向模型,其核心创新在于通过测试时训练(test-time training)层将整个图像集合压缩为一个紧凑的隐式场景状态(hidden scene state),从而在单次前向传播中实现线性时间、双向的3D重建。该机制不仅显著提升了速度(在单张H100 GPU上可完成超过700帧的重建且耗时少于10秒,比最优方法快20倍以上),还保留甚至超越了二次复杂度方法的精度,并支持实时场景状态查询与流式序列重建。

链接: https://arxiv.org/abs/2603.04385
作者: Haian Jin,Rundi Wu,Tianyuan Zhang,Ruiqi Gao,Jonathan T. Barron,Noah Snavely,Aleksander Holynski
机构: Google DeepMind; Cornell University; Massachusetts Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and \pi^3 have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than 20\times faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.

[CV-2] Helios: Real Real-Time Long Video Generation Model

【速读】:该论文旨在解决长视频生成中普遍存在的时间漂移(drifting)问题,以及现有模型在实时生成效率和训练可扩展性方面的局限性。其核心挑战在于:如何在不依赖传统抗漂移策略(如自强制、误差库或关键帧采样)的前提下实现稳定、高质量的分钟级视频生成;同时,在不使用KV缓存、稀疏注意力等加速技术的情况下实现近实时推理,并在单卡环境下完成大规模模型训练。解决方案的关键在于三个方面:一是通过显式模拟漂移过程并优化训练策略来从源头抑制重复运动与漂移;二是采用高效的历史上下文压缩与采样步数减少机制,使计算成本低于甚至等于1.3B规模模型;三是引入基础设施级优化以降低内存占用并提升训练与推理速度。最终,Helios作为首个可在单张NVIDIA H100 GPU上以19.5 FPS运行的14B参数自回归扩散视频生成模型,实现了高保真度、低延迟和易部署的统一视频生成能力(支持文本到视频、图像到视频及视频到视频任务)。

链接: https://arxiv.org/abs/2603.04379
作者: Shenghai Yuan,Yuanyang Yin,Zongjian Li,Xinwei Huang,Xiao Yang,Li Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this http URL

点击查看摘要

Abstract:We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to – or lower than – those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.

[CV-3] FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长时程第一人称视频理解任务时面临的两个核心问题:一是随着输入帧数增加,模型响应质量下降;二是推理时间显著增长。为应对上述挑战,论文提出FocusGraph框架,其关键在于引入一个轻量级可训练的Scene-Caption LLM Selector,该模块基于图结构化的场景描述文本而非原始低分辨率帧序列来选择与查询相关的片段,从而实现高效的语义感知片段筛选;随后采用无需训练的Patch-wise Sparse-Flow Retention(PSFR)方法从选定片段中进一步提取关键帧,最终将这些关键帧输入MLLM以生成答案。此双阶段设计在保持高准确率的同时大幅降低计算开销,实现在FindingDory和HourVideo等基准上的SOTA性能。

链接: https://arxiv.org/abs/2603.04349
作者: Tatiana Zemskova,Solomon Andryushenko,Ilya Obrubov,Viktoriia Khoruzhaia,Ekaterina Eroshenko,Ekaterina Derevyanka,Dmitry Yudin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.04349 [cs.CV] (or arXiv:2603.04349v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.04349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-4] RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation

【速读】:该论文旨在解决病理报告生成任务中因全切片图像(Whole Slide Images, WSIs)的吉字节级规模和复杂的形态异质性所带来的挑战,以及现有基于Transformer架构的生成框架在解码器结构同质化和静态知识检索整合方面存在的局限性,这些问题限制了生成的专业特化能力并可能引入噪声干扰。解决方案的关键在于提出RANGER框架,其核心创新包括:1)在解码器中集成稀疏门控专家混合(sparsely-gated Mixture-of-Experts, MoE)机制,结合噪声敏感的top-k路由策略与负载平衡正则化,实现针对不同诊断模式的动态专家专业化;2)设计自适应检索重排序模块,在知识库记忆整合前基于视觉特征表示选择性地优化检索内容,降低噪声并提升语义对齐度,从而增强病理报告的语义基础性和准确性。

链接: https://arxiv.org/abs/2603.04348
作者: Yixin Chen,Ziyu Su,Hikmat Khan,Muhammad Khalid Khan Niazi
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heterogeneity of Whole Slide Images (WSIs). Existing pathology report generation frameworks typically employ transformer architectures, relying on a homogeneous decoder architecture and static knowledge retrieval integration. Such architectures limit generative specialization and may introduce noisy external guidance during the report generation process. To address these limitations, we propose RANGER, a sparsely-gated Mixture-of-Experts (MoE) framework with adaptive retrieval re-ranking for pathology report generation. Specifically, we integrate a sparsely gated MoE into the decoder, along with noisy top- k routing and load-balancing regularization, to enable dynamic expert specialization across various diagnostic patterns. Additionally, we introduce an adaptive retrieval re-ranking module that selectively refines retrieved memory from a knowledge base before integration, reducing noise and improving semantic alignment based on visual feature representations. We perform extensive experiments on the PathText-BRCA dataset and demonstrate consistent improvements over existing approaches across standard natural language generation metrics. Our full RANGER model achieves optimal performance on PathText dataset, reaching BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, with METEOR of 0.1883, and ROUGE-L of 0.3038, validating the effectiveness of dynamic expert routing and adaptive knowledge refinement for semantically grounded pathology report generation.

[CV-5] Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

【速读】:该论文旨在解决大规模视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)在新兴、专业或代表性不足领域中零样本性能难以评估的问题,尤其是在缺乏标注测试集的情况下(如全球南方地区的数据)。传统评估方法依赖于大量标注数据,而这些数据在许多实际场景中不可获得。解决方案的关键在于提出一种高数据效率的预测方法:仅需每类一个标注图像,利用大语言模型(Large Language Model, LLM)生成该图像的合理反事实描述作为难负样本,通过测量VLFM在共享嵌入空间中区分正确描述与这些难负样本的能力,提取反映其判别力的特征;随后使用线性回归器基于这些相似度得分预测VLFM在目标域上的零样本准确率,实验证明该方法在多个数据集上达到Pearson相关系数0.96,显著提升了对VLFM性能的可测性和可解释性。

链接: https://arxiv.org/abs/2603.04346
作者: Chris Vorster,Mayug Maniparambil,Noel E. O’Connor,Noel Murphy,Derek Molloy
机构: ML-Labs, Dublin City University, Dublin, Ireland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM’s zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM’s ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM’s discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM’s zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method’s performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: this https URL.

[CV-6] Enhancing Authorship Attribution with Synthetic Paintings ICML

【速读】:该论文旨在解决艺术作品作者归属(authorship attribution)任务中因真实艺术品数据稀缺而导致的计算模型训练困难问题。其解决方案的关键在于提出一种混合方法,通过DreamBooth微调生成的合成图像与真实绘画数据相结合,从而提升分类模型在相似艺术风格下的准确性和泛化能力。实验表明,引入合成图像可显著提高ROC-AUC和准确率,验证了生成式AI(Generative AI)与判别式方法融合在数据稀缺场景下对艺术作品认证的有效性。

链接: https://arxiv.org/abs/2603.04343
作者: Clarissa Loures,Caio Hosken,Luan Oliveira,Gianlucca Zuin,Adriano Veloso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the 24th IEEE International Conference on Machine Learning and Applications (ICMLA 2025)

点击查看摘要

Abstract:Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for training computational models. This study investigates whether synthetic images, generated through DreamBooth fine-tuning of Stable Diffusion, can improve the performance of classification models in this context. We propose a hybrid approach that combines real and synthetic data to enhance model accuracy and generalization across similar artistic styles. Experimental results show that adding synthetic images leads to higher ROC-AUC and accuracy compared to using only real paintings. By integrating generative and discriminative methods, this work contributes to the development of computer vision techniques for artwork authentication in data-scarce scenarios.

[CV-7] Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

【速读】:该论文旨在解决现有CLIP(Contrastive Language–Image Pretraining)适配方法中依赖验证集选择混合比例(blending ratio)的问题,这类方法在实际应用中无法严格满足少样本(few-shot)场景的要求。其关键解决方案是提出一种无需验证集的“Hold-One-Shot-Out”(HOSO)机制:在训练过程中,从原始支持样本中保留一个样本作为“hold-out”集用于学习最优混合比例,其余样本用于训练适配器(adapter),从而实现真正的验证-free少样本适配。实验表明,HOSO-Adapter在11个标准少样本数据集上平均性能提升超过4个百分点,且在8-shot和16-shot设置下优于使用测试集优化混合比例的基线模型。

链接: https://arxiv.org/abs/2603.04341
作者: Chris Vorster,Mayug Maniparambil,Noel E. O’Connor,Noel Murphy,Derek Molloy
机构: ML-Labs, Dublin City University, Dublin, Ireland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In many CLIP adaptation methods, a blending ratio hyperparameter controls the trade-off between general pretrained CLIP knowledge and the limited, dataset-specific supervision from the few-shot cases. Most few-shot CLIP adaptation techniques report results by ablation of the blending ratio on the test set or require additional validation sets to select the blending ratio per dataset, and thus are not strictly few-shot. We present a simple, validation-free method for learning the blending ratio in CLIP adaptation. Hold-One-Shot-Out (HOSO) presents a novel approach for CLIP-Adapter-style methods to compete in the newly established validation-free setting. CLIP-Adapter with HOSO (HOSO-Adapter) learns the blending ratio using a one-shot, hold-out set, while the adapter trains on the remaining few-shot support examples. Under the validation-free few-shot protocol, HOSO-Adapter outperforms the CLIP-Adapter baseline by more than 4 percentage points on average across 11 standard few-shot datasets. Interestingly, in the 8- and 16-shot settings, HOSO-Adapter outperforms CLIP-Adapter even with the optimal blending ratio selected on the test set. Ablation studies validate the use of a one-shot hold-out mechanism, decoupled training, and improvements over the naively learnt blending ratio baseline. Code is released here: this https URL

[CV-8] Balancing Fidelity Utility and Privacy in Synthetic Cardiac MRI Generation: A Comparative Study

【速读】:该论文旨在解决心脏磁共振成像(Cardiac MRI, CMR)中深度学习模型因数据稀缺和隐私法规限制而难以训练的问题。其解决方案的关键在于采用一种两阶段生成式建模框架,利用解剖掩码(anatomical masks)作为条件来引导图像合成,并系统性地评估三种生成架构——去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)、潜在扩散模型(Latent Diffusion Models, LDM)和流匹配(Flow Matching, FM)——在保真度、下游任务实用性及隐私保护三个维度上的表现。研究发现,DDPM在有限数据条件下最有效地平衡了分割任务性能、图像质量与隐私保护,而FM则展现出更强的隐私特性但任务性能略低,从而为医学影像合成数据增强提供了可量化的权衡框架。

链接: https://arxiv.org/abs/2603.04340
作者: Madhura Edirisooriya,Dasuni Kawya,Ishan Kumarasinghe,Isuri Devindi,Mary M. Maleckar,Roshan Ragel,Isuru Nawinne,Vajira Thambawita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, Preprint

点击查看摘要

Abstract:Deep learning in cardiac MRI (CMR) is fundamentally constrained by both data scarcity and privacy regulations. This study systematically benchmarks three generative architectures: Denoising Diffusion Probabilistic Models (DDPM), Latent Diffusion Models (LDM), and Flow Matching (FM) for synthetic CMR generation. Utilizing a two-stage pipeline where anatomical masks condition image synthesis, we evaluate generated data across three critical axes: fidelity, utility, and privacy. Our results show that diffusion-based models, particularly DDPM, provide the most effective balance between downstream segmentation utility, image fidelity, and privacy preservation under limited-data conditions, while FM demonstrates promising privacy characteristics with slightly lower task-level performance. These findings quantify the trade-offs between cross-domain generalization and patient confidentiality, establishing a framework for safe and effective synthetic data augmentation in medical imaging.

[CV-9] ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

【速读】:该论文旨在解决无3D/4D监督条件下生成物理合理且具关节结构的人体-物体交互(articulated human-object interaction, HOI)这一核心挑战。现有零样本方法虽利用视频扩散模型合成HOI,但主要局限于刚性物体操作,缺乏显式的4D几何推理能力。其解决方案的关键在于将关节式HOI合成建模为从单目视频先验出发的4D重建问题:通过将扩散模型生成的2D视频视为逆渲染任务的监督信号,恢复出几何一致且物理合理的4D场景。具体创新包括:1)基于光流的部件分割,利用光学流作为几何线索分离视频中的动态与静态区域;2)解耦重建流程——由于单目模糊性导致人体运动与物体关节联合优化不稳定,故先重建物体关节状态,再条件化地合成人体运动。此方法实现了视频生成与几何感知重建的融合,显著提升了接触准确性、穿透减少和关节保真度,在多种复杂场景(如打开冰箱、橱柜、微波炉)中优于现有方法,推动了零样本交互合成从刚性操作向关节式交互的扩展。

链接: https://arxiv.org/abs/2603.04338
作者: Zihao Huang,Tianqi Liu,Zhaoxi Chen,Shaocong Xu,Saining Zhang,Lixing Xiao,Zhiguo Cao,Wei Li,Hao Zhao,Ziwei Liu
机构: Huazhong University of Science and Technology (华中科技大学); Nanyang Technological University (南洋理工大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

[CV-10] Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images

【速读】:该论文旨在解决生成式 AI (Generative AI) 在自动驾驶等安全关键场景中用于合成罕见或极端环境条件(如雾、雨、雪和夜间)图像时,其生成图像的真实性难以量化评估的问题。解决方案的关键在于提出一个可扩展的评估框架,结合两种互补的自动化指标:一是基于视觉-语言模型(VLM)的裁判系统以评估感知真实性,二是基于嵌入的空间分布分析来衡量合成图像与真实恶劣天气图像的相似性。实验表明,生成式 AI 方法显著优于传统规则增强方法,在多数条件下达到甚至超过真实图像的接受率,验证了现代生成模型在构建高保真合成数据方面的潜力。

链接: https://arxiv.org/abs/2603.04325
作者: Damian J. Ruck,Paul Vautravers,Oliver Chalkley,Jake Thomas
机构: Advai Ltd(Advai有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2603.04325 [cs.CV] (or arXiv:2603.04325v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.04325 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jake Thomas [view email] [v1] Wed, 4 Mar 2026 17:46:08 UTC (9,709 KB)

[CV-11] SPRINT: Semi-supervised Prototypical Representation for Few-Shot Class-Incremental Tabular Learning

【速读】:该论文旨在解决**少样本类增量学习(Few-Shot Class-Incremental Learning, FSCIL)**在表格数据(tabular data)领域中缺乏有效方法的问题。传统FSCIL方法主要面向图像领域,依赖固定缓冲区存储历史数据,而表格数据通常具有大量未标注数据、专家标注稀缺且存储成本极低的特点,现有方法未能充分利用这些特性。解决方案的关键在于提出名为SPRINT的首个专为表格分布设计的FSCIL框架,其核心创新包括:(1)采用基于置信度的伪标签策略进行混合 episodic 训练,以增强新类别的表征能力;(2)利用表格数据的低存储成本保留基础类别历史信息,从而缓解灾难性遗忘问题。实验证明,SPRINT在六个跨领域的基准上实现了77.37%的平均准确率(5-shot),显著优于现有最优增量学习基线。

链接: https://arxiv.org/abs/2603.04321
作者: Umid Suleymanov,Murat Kantarcioglu,Kevin S Chan,Michael De Lucia,Kevin Hamlen,Latifur Khan,Sharad Mehrotra,Ananthram Swami,Bhavani Thuraisingham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Real-world systems must continuously adapt to novel concepts from limited data without forgetting previously acquired knowledge. While Few-Shot Class-Incremental Learning (FSCIL) is established in computer vision, its application to tabular domains remains largely unexplored. Unlike images, tabular streams (e.g., logs, sensors) offer abundant unlabeled data, a scarcity of expert annotations and negligible storage costs, features ignored by existing vision-based methods that rely on restrictive buffers. We introduce SPRINT, the first FSCIL framework tailored for tabular distributions. SPRINT introduces a mixed episodic training strategy that leverages confidence-based pseudo-labeling to enrich novel class representations and exploits low storage costs to retain base class history. Extensive evaluation across six diverse benchmarks spanning cybersecurity, healthcare, and ecological domains, demonstrates SPRINT’s cross-domain robustness. It achieves a state-of-the-art average accuracy of 77.37% (5-shot), outperforming the strongest incremental baseline by 4.45%.

[CV-12] MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

【速读】:该论文旨在解决动物重识别(Animal Re-identification, Animal ReID)在空地协同视角(Aerial-Ground ReID, AG-ReID)场景下因视角变化带来的关键挑战,尤其是模型在极端高程差异下的泛化能力不足问题。现有数据集缺乏精确的视角角度标注,难以系统分析几何变化对识别性能的影响。其解决方案的关键在于构建一个大规模、可控的合成数据集——Multi-view Oriented Observation (MOO) 数据集,包含1000头牛从128个均匀采样视角拍摄的共12.8万张标注图像,并通过该数据集量化了高程变化对模型性能的影响,发现存在一个关键高程阈值,超过该阈值后模型在未见视角上的泛化能力显著提升。此外,研究验证了合成几何先验在零样本和监督设置下向真实世界场景迁移的有效性,在四个真实世界牛只数据集上均取得性能提升,证明了合成数据可有效弥合域间差距,为跨视角动物ReID模型的发展奠定基础。

链接: https://arxiv.org/abs/2603.04314
作者: William Grolleau,Achraf Chaouch,Astrid Sabourin,Guillaume Lapouge,Catherine Achard
机构: Universite Paris-Saclay, CEA, List, F-91120 Palaiseau, France; Sorbonne University, CNRS, ISIR, 4 Place Jussieu, 75005 Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of 1,000 cattle individuals captured from 128 uniformly sampled viewpoints ( 128,000 annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at this https URL.

[CV-13] CRESTomics: Analyzing Carotid Plaques in the CREST-2 Trial with a New Additive Classification Model

【速读】:该论文旨在解决颈动脉斑块(carotid plaque)的精准表征问题,以提升对颈动脉狭窄患者卒中预防的临床决策能力。其核心挑战在于从B-mode超声图像中提取可解释且高预测性能的影像组学(radiomics)特征,并将其与高风险临床结局相关联。解决方案的关键在于提出一种基于核函数的加性模型(kernel-based additive model),融合相干损失(coherence loss)与组稀疏正则化(group-sparse regularization),从而实现非线性分类的同时保留特征组的结构信息;并通过部分依赖图(partial dependence plots)可视化各特征组的加性效应,增强了模型的可解释性,最终揭示了斑块纹理与临床风险之间的强关联。

链接: https://arxiv.org/abs/2603.04309
作者: Pranav Kulkarni,Brajesh K. Lal,Georges Jreij,Sai Vallamchetla,Langford Green,Jenifer Voeks,John Huston,Lloyd Edwards,George Howard,Bradley A. Maron,Thomas G. Brott,James F. Meschia,Florence X. Doo,Heng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, 1 table, accepted to ISBI 2026

点击查看摘要

Abstract:Accurate characterization of carotid plaques is critical for stroke prevention in patients with carotid stenosis. We analyze 500 plaques from CREST-2, a multi-center clinical trial, to identify radiomics-based markers from B-mode ultrasound images linked with high-risk. We propose a new kernel-based additive model, combining coherence loss with group-sparse regularization for nonlinear classification. Group-wise additive effects of each feature group are visualized using partial dependence plots. Results indicate our method accurately and interpretably assesses plaques, revealing a strong association between plaque texture and clinical risk.

[CV-14] Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

【速读】:该论文旨在解决当前生成式3D虚拟形象(3D avatar)方法中存在的两大核心问题:一是文本驱动方法依赖迭代Score Distillation Sampling (SDS) 或 CLIP 优化,导致细粒度语义控制能力弱且推理速度极慢;二是图像驱动方法受限于高质量3D人脸扫描数据稀缺与获取成本高,制约了模型的泛化能力。解决方案的关键在于构建一个包含超过10万对多模态数据(细粒度文本描述、野外人脸图像、高质量光照归一化纹理UV图及3D几何形状)的大规模数据集,并提出PromptAvatar框架,该框架采用双扩散模型结构——纹理扩散模型(Texture Diffusion Model, TDM)支持文本和/或图像条件引导,几何扩散模型(Geometry Diffusion Model, GDM)由文本提示引导,通过学习从多模态提示到3D表示的直接映射关系,彻底摒弃了传统迭代优化步骤,在10秒内即可生成无阴影、高保真度的3D虚拟形象,显著提升了生成质量、细节一致性与计算效率。

链接: https://arxiv.org/abs/2603.04307
作者: Hong Li,Yutang Feng,Minqi Meng,Yichen Yang,Xuhui Liu,Baochang Zhang
机构: Beihang University (北京航空航天大学); KAUST (沙特阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.

[CV-15] Motion Manipulation via Unsupervised Keypoint Positioning in Face Animation

【速读】:该论文旨在解决当前基于无监督关键点定位的面部动画方法在可控性方面的局限性,即现有关键点分解流程无法充分解耦身份语义与交织的运动信息(如旋转、平移和表情)。其解决方案的关键在于提出一种名为MMFA(Motion Manipulation via unsupervised keypoint positioning in Face Animation)的新方法:首先引入自监督表示学习,在潜在特征空间中编码并解码表情信息,从而将其与其他运动信息分离;其次设计了一种新的关键点计算方式以实现任意运动控制;最后通过变分自编码器将表情特征映射到连续的高斯分布,首次在无监督框架下实现了面部表情的插值。

链接: https://arxiv.org/abs/2603.04302
作者: Hong Li,Boyu Liu,Xuhui Liu,Baochang Zhang
机构: BUAA (北京航空航天大学); KAUST (沙特阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:Face animation deals with controlling and generating facial features with a wide range of applications. The methods based on unsupervised keypoint positioning can produce realistic and detailed virtual portraits. However, they cannot achieve controllable face generation since the existing keypoint decomposition pipelines fail to fully decouple identity semantics and intertwined motion information (e.g., rotation, translation, and expression). To address these issues, we present a new method, Motion Manipulation via unsupervised keypoint positioning in Face Animation (MMFA). We first introduce self-supervised representation learning to encode and decode expressions in the latent feature space and decouple them from other motion information. Secondly, we propose a new way to compute keypoints aiming to achieve arbitrary motion control. Moreover, we design a variational autoencoder to map expression features to a continuous Gaussian distribution, allowing us for the first time to interpolate facial expressions in an unsupervised framework. We have conducted extensive experiments on publicly available datasets to validate the effectiveness of MMFA, which show that MMFA offers pronounced advantages over prior arts in creating realistic animation and manipulating face motion.

[CV-16] CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video CVPR2026

【速读】:该论文旨在解决从视角输入生成高质量360°全景视频的难题,尤其针对虚拟现实(VR)应用中对高分辨率视频的需求。现有方法受限于普通扩散模型的计算能力,仅支持≤1K分辨率的原生生成,并依赖次优的后处理超分辨率技术提升分辨率。解决方案的关键在于提出CubeComposer,一种新颖的时空自回归扩散模型,能够原生生成4K分辨率的360°视频;其核心创新包括:(1) 通过六面立方图(cubemap)分解视频并设计时空自回归策略,实现跨立方面与时间窗口的连贯合成;(2) 引入稀疏上下文注意力机制的立方面上下文管理机制,提升效率;(3) 采用面向连续性的技术(如立方体感知的位置编码、填充和融合)消除边界接缝,从而在保证视觉质量的同时显著降低内存消耗,支持实际VR应用场景。

链接: https://arxiv.org/abs/2603.04291
作者: Lingen Li,Guangzhi Wang,Xiaoyu Li,Zhaoyang Zhang,Qi Dou,Jinwei Gu,Tianfan Xue,Ying Shan
机构: The Chinese University of Hong Kong; ARC Lab, Tencent PCG
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting \leq 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: this https URL

[CV-17] Gaussian Wardrobe: Compositional 3D Gaussian Avatars for Free-Form Virtual Try-On

【速读】:该论文旨在解决现有3D神经化身(Neural Avatar)方法将人体与衣物视为不可分割整体所带来的局限性,即难以捕捉复杂自由形态服装的动力学特性,并限制了服装在不同个体间的复用。其解决方案的关键在于提出了一种分层的、可组合的3D高斯表示框架——Gaussian Wardrobe,通过将神经化身解耦为身体和多层形状无关的神经服装(shape-agnostic neural garments),并从多视角视频中学习每层服装的独立表示,将其归一化到与形状无关的空间,从而实现高保真度的动态建模与灵活的虚拟试穿应用。

链接: https://arxiv.org/abs/2603.04290
作者: Zhiyi Chen,Hsuan-I Ho,Tianjian Jiang,Jie Song,Manuel Kaufmann,Chen Guo
机构: ETH Zürich (苏黎世联邦理工学院); HKUST(GZ) (香港科技大学广州校区); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 3DV 2026, 16 pages, 12 figures

点击查看摘要

Abstract:We introduce Gaussian Wardrobe, a novel framework to digitalize compositional 3D neural avatars from multi-view videos. Existing methods for 3D neural avatars typically treat the human body and clothing as an inseparable entity. However, this paradigm fails to capture the dynamics of complex free-form garments and limits the reuse of clothing across different individuals. To overcome these problems, we develop a novel, compositional 3D Gaussian representation to build avatars from multiple layers of free-form garments. The core of our method is decomposing neural avatars into bodies and layers of shape-agnostic neural garments. To achieve this, our framework learns to disentangle each garment layer from multi-view videos and canonicalizes it into a shape-independent space. In experiments, our method models photorealistic avatars with high-fidelity dynamics, achieving new state-of-the-art performance on novel pose synthesis benchmarks. In addition, we demonstrate that the learned compositional garments contribute to a versatile digital wardrobe, enabling a practical virtual try-on application where clothing can be freely transferred to new subjects. Project page: this https URL

[CV-18] A multi-center analysis of deep learning methods for video polyp detection and segmentation

【速读】:该论文旨在解决结肠息肉(colonic polyps)在结肠镜检查中因形态、位置和大小差异导致的检测与切除困难问题,从而降低漏诊率和不完全切除率,提升结直肠癌(colorectal cancer, CRC)的预防效果。其解决方案的关键在于利用深度学习技术结合序列数据和时间信息,通过捕捉息肉生长的动态变化及帧间时序关系,显著提高自动化检测与分割方法的精度,进而增强实时临床应用中的诊断准确性。

链接: https://arxiv.org/abs/2603.04288
作者: Noha Ghatwary,Pedro Chavarias Solano,Mohamed Ramzy Ibrahim,Adrian Krenzer,Frank Puppe,Stefano Realdon,Renato Cannizzaro,Jiacheng Wang,Liansheng Wang,Thuy Nuong Tran,Lena Maier-Hein,Amine Yamlahi,Patrick Godau,Quan He,Qiming Wan,Mariia Kokshaikyna,Mariia Dobko,Haili Ye,Heng Li,Ragu B,Antony Raj,Hanaa Nagdy,Osama E Salem,James E. East,Dominique Lamarque,Thomas de Lange,Sharib Ali
机构: Arab Academy for Science and Technology (阿拉伯科技工程大学); University of Leeds (利兹大学); University of Würzburg (维尔茨堡大学); CRO Centro Riferimento Oncologico IRCCS (意大利肿瘤研究中心); Xiamen University (厦门大学); German Cancer Research Center (DKFZ) (德国癌症研究中心); Hangzhou Hikvision Digital Technology Co.,ltd (杭州海康威视数字技术有限公司); Ukrainian Catholic University (乌克兰天主教大学); Southern University of Science and Technology (南方科技大学); Healthcare Technology Innovation Centre (医疗技术创新中心); Arab Academy for Science and Technology (阿拉伯科技工程大学); University of Alexandria (亚历山大大学); University of Oxford (牛津大学); Université de Versailles St-Quentin en Yvelines (凡尔赛圣 Quentin昂伊夫林大学); Sahlgrenska Academy, University of Gothenburg (哥德堡大学萨尔格伦学院); Sahlgrenska University Hospital-Mölndal (哥德堡大学医院莫尔达尔分院); University of Gothenburg (哥德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Colonic polyps are well-recognized precursors to colorectal cancer (CRC), typically detected during colonoscopy. However, the variability in appearance, location, and size of these polyps complicates their detection and removal, leading to challenges in effective surveillance, intervention, and subsequently CRC prevention. The processes of colonoscopy surveillance and polyp removal are highly reliant on the expertise of gastroenterologists and occur within the complexities of the colonic structure. As a result, there is a high rate of missed detections and incomplete removal of colonic polyps, which can adversely impact patient outcomes. Recently, automated methods that use machine learning have been developed to enhance polyps detection and segmentation, thus helping clinical processes and reducing missed rates. These advancements highlight the potential for improving diagnostic accuracy in real-time applications, which ultimately facilitates more effective patient management. Furthermore, integrating sequence data and temporal information could significantly enhance the precision of these methods by capturing the dynamic nature of polyp growth and the changes that occur over time. To rigorously investigate these challenges, data scientists and experts gastroenterologists collaborated to compile a comprehensive dataset that spans multiple centers and diverse populations. This initiative aims to underscore the critical importance of incorporating sequence data and temporal information in the development of robust automated detection and segmentation methods. This study evaluates the applicability of deep learning techniques developed in real-time clinical colonoscopy tasks using sequence data, highlighting the critical role of temporal relationships between frames in improving diagnostic precision.

[CV-19] SSR: A Generic Framework for Text-Aided Map Compression for Localization

【速读】:该论文旨在解决机器人在复杂环境中依赖大规模地图进行定位与决策时所面临的存储和传输成本过高的问题,尤其是在冷存储、网络传输及云端定位查询场景下,内存和带宽开销难以承受。解决方案的关键在于提出一种文本增强的压缩框架,将文本作为替代模态进行无损压缩,并结合轻量级文本描述与极小的图像特征向量,构建出具有“互补信息”的紧凑表示形式;其中,核心创新是提出的相似性空间复制(Similarity Space Replication, SSR)技术,能够在单次训练中学习到仅包含与文本描述互补信息的自适应图像嵌入,从而实现更高效的压缩比,实验表明其在多个下游定位任务中相比现有基线方法压缩效率提升2倍。

链接: https://arxiv.org/abs/2603.04272
作者: Mohammad Omama,Po-han Li,Harsh Goel,Minkyu Choi,Behdad Chalaki,Vaishnav Tadiparthi,Hossein Nourkhiz Mahjoub,Ehsan Moradi Pari,Sandeep P. Chinchali
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Honda Research Institute (本田研究 institute)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mapping is crucial in robotics for localization and downstream decision-making. As robots are deployed in ever-broader settings, the maps they rely on continue to increase in size. However, storing these maps indefinitely (cold storage), transferring them across networks, or sending localization queries to cloud-hosted maps imposes prohibitive memory and bandwidth costs. We propose a text-enhanced compression framework that reduces both memory and bandwidth footprints while retaining high-fidelity localization. The key idea is to treat text as an alternative modality: one that can be losslessly compressed with large language models. We propose leveraging lightweight text descriptions combined with very small image feature vectors, which capture “complementary information” as a compact representation for the mapping task. Building on this, our novel technique, Similarity Space Replication (SSR), learns an adaptive image embedding in one shot that captures only the information “complementary” to the text descriptions. We validate our compression framework on multiple downstream localization tasks, including Visual Place Recognition as well as object-centric Monte Carlo localization in both indoor and outdoor settings. SSR achieves 2 times better compression than competing baselines on state-of-the-art datasets, including TokyoVal, Pittsburgh30k, Replica, and KITTI.

[CV-20] ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos CVPR2026

【速读】:该论文旨在解决程序性规划(procedural planning)中现有方法依赖大规模模型隐式学习程序结构所导致的样本效率低和计算成本高的问题。其核心解决方案是提出ViterbiPlanNet框架,关键创新在于引入可微分维特比层(Differentiable Viterbi Layer, DVL),将程序知识图谱(Procedural Knowledge Graph, PKG)与维特比解码算法显式结合,通过平滑松弛替代非可微操作,实现端到端优化。该设计使模型能够基于图结构进行解码学习,显著提升样本效率和对未见短时 horizon 的鲁棒性,同时在多个基准任务上以更少参数达到最优性能。

链接: https://arxiv.org/abs/2603.04265
作者: Luigi Seminara,Davide Moltisanti,Antonino Furnari
机构: University of Catania (卡塔尼亚大学); University of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.

[CV-21] A Hypertoroidal Covering for Perfect Color Equivariance

【速读】:该论文旨在解决传统神经网络在输入图像颜色分布发生变化时性能显著下降的问题,尤其是在面对色相(hue)、饱和度(saturation)和亮度(luminance)等颜色属性变化时鲁棒性不足的缺陷。现有方法虽尝试引入颜色几何先验知识构建颜色等变架构(color equivariant architectures),但将饱和度与亮度这类区间值量近似为一维平移(1D translations)会引入可观测的伪影(artifacts)。其解决方案的关键在于:不再将区间值映射到实数线,而是将其提升(lifting)至圆周上(通过双重覆盖,即double-cover),并在该流形空间中构建真正的等变表示。这一方法从根本上消除了先前近似带来的误差,提升了模型的可解释性和泛化能力,并在细粒度分类与医学影像任务中优于传统及现有等变基线模型。此外,该提升策略还可扩展至尺度等几何变换,展现出更广泛的适用性。

链接: https://arxiv.org/abs/2603.04256
作者: Yulong Yang,Zhikun Xu,Yaojun Li,Christine Allen-Blanchette
机构: Princeton University (普林斯顿大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a double-cover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.

[CV-22] EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding CVPR2026

【速读】:该论文旨在解决开放词汇场景理解(open-vocabulary scene understanding)中,如何在在线、近实时条件下实现3D场景重建与语义理解的联合优化问题。传统方法通常受限于离线或单场景优化设置,难以适应动态环境下的持续感知需求。其解决方案的关键在于提出EmbodiedSplat,一种基于3D高斯表示(3D Gaussian Splatting, 3DGS)的在线前馈架构:通过引入在线稀疏系数场(Online Sparse Coefficients Field)CLIP全局码本(CLIP Global Codebook),将2D CLIP嵌入绑定至每个3D高斯点,在最小化内存消耗的同时保留CLIP的全语义泛化能力;同时,利用3D U-Net聚合部分点云生成几何感知的CLIP特征,弥补纯2D语言嵌入对3D几何先验的缺失,从而实现高效且通用的在线3D语义重建。

链接: https://arxiv.org/abs/2603.04254
作者: Seungjun Lee,Zihan Wang,Yunsong Wang,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL

点击查看摘要

Abstract:Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in this https URL.

[CV-23] A Unified Framework for Joint Detection of Lacunes and Enlarged Perivascular Spaces

【速读】:该论文旨在解决脑小血管病(Cerebral Small Vessel Disease, CSVD)影像分析中,扩大的周围血管间隙(Enlarged Perivascular Spaces, EPVS)与腔隙性梗死(Lacunae)在放射学表现上高度相似所带来的挑战,尤其是标准分割网络在处理二者时面临的特征干扰和极端类别不平衡问题。其解决方案的关键在于提出一种形态解耦框架(morphology-decoupled framework),通过零初始化门控跨任务注意力机制(Zero-Initialized Gated Cross-Task Attention)利用密集的EPVS上下文信息引导稀疏的腔隙检测;同时引入混合监督策略,结合互斥约束(Mutual Exclusion)与中心线Dice损失(Centerline Dice loss)以确保生物和拓扑一致性;并设计解剖信息引导的推理校准机制(Anatomically-Informed Inference Calibration),基于组织语义动态抑制假阳性结果。

链接: https://arxiv.org/abs/2603.04243
作者: Lucas He,Krinos Li,Hanyuan Zhang,Runlong He,Silvia Ingala,Luigi Lorenzini,Marleen de Bruijne,Frederik Barkhof,Rhodri Davies,Carole Sudre
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cerebral small vessel disease (CSVD) markers, specifically enlarged perivascular spaces (EPVS) and lacunae, present a unique challenge in medical image analysis due to their radiological mimicry. Standard segmentation networks struggle with feature interference and extreme class imbalance when handling these divergent targets simultaneously. To address these issues, we propose a morphology-decoupled framework where Zero-Initialized Gated Cross-Task Attention exploits dense EPVS context to guide sparse lacune detection. Furthermore, biological and topological consistency are enforced via a mixed-supervision strategy integrating Mutual Exclusion and Centerline Dice losses. Finally, we introduce an Anatomically-Informed Inference Calibration mechanism to dynamically suppress false positives based on tissue semantics. Extensive 5-folds cross-validation on the VALDO 2021 dataset (N=40) demonstrates state-of-the-art performance, notably surpassing task winners in lacunae detection precision (71.1%, p=0.01) and F1-score (62.6%, p=0.03). Furthermore, evaluation on the external EPAD cohort (N=1762) confirms the model’s robustness for large-scale population studies. Code will be released upon acceptance.

[CV-24] DeNuC: Decoupling Nuclei Detection and Classification in Histopathology

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, FMs)在细胞核检测与分类(Nuclei Detection and Classification, NDC)任务中表现不佳的问题,特别是发现联合优化检测与分类会导致表示能力退化,且因两任务难度差异显著而使检测阶段计算负担过重。解决方案的关键在于提出DeNuC方法,通过解耦(Decoupling)细胞核检测与分类流程:首先使用轻量级模型实现高精度细胞核定位,再利用病理基础模型对检测坐标处的区域进行特征编码并查询特定细胞核特征用于分类,从而有效释放FMs在NDC任务中的表征潜力。实验表明,DeNuC在多个基准数据集上显著优于现有方法,同时参数量仅为其他方法的16%或更低。

链接: https://arxiv.org/abs/2603.04240
作者: Zijiang Yang,Chen Kuang,Dongmei Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Pathology Foundation Models (FMs) have shown strong performance across a wide range of pathology image representation and diagnostic tasks. However, FMs do not exhibit the expected performance advantage over traditional specialized models in Nuclei Detection and Classification (NDC). In this work, we reveal that jointly optimizing nuclei detection and classification leads to severe representation degradation in FMs. Moreover, we identify that the substantial intrinsic disparity in task difficulty between nuclei detection and nuclei classification renders joint NDC optimization unnecessarily computationally burdensome for the detection stage. To address these challenges, we propose DeNuC, a simple yet effective method designed to break through existing bottlenecks by Decoupling Nuclei detection and Classification. DeNuC employs a lightweight model for accurate nuclei localization, subsequently leveraging a pathology FM to encode input images and query nucleus-specific features based on the detected coordinates for classification. Extensive experiments on three widely used benchmarks demonstrate that DeNuC effectively unlocks the representational potential of FMs for NDC and significantly outperforms state-of-the-art methods. Notably, DeNuC improves F1 scores by 4.2% and 3.6% (or higher) on the BRCAM2C and PUMA datasets, respectively, while using only 16% (or fewer) trainable parameters compared to other methods. Code is available at this https URL.

[CV-25] DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers CVPR2026

【速读】:该论文旨在解决扩散Transformer(Diffusion Transformers, DiTs)在表示学习机制不明确的问题,特别是如何提升其内部表征的多样性以增强模型性能。现有方法如REPA通过引入预训练编码器进行表征对齐,但缺乏对DiTs内部表征动态演化规律的深入理解。为此,作者系统性地分析了不同设置下DiTs内部表征的演变与影响,发现块间表征多样性是有效学习的关键因素。解决方案的核心在于提出DiverseDiT框架,通过引入长残差连接以多样化各层输入表征,并设计表征多样性损失函数促使各层学习差异化特征,从而显著提升生成质量与收敛速度,在ImageNet 256x256和512x512数据集上均验证了其有效性及对多种骨干网络的普适性。

链接: https://arxiv.org/abs/2603.04239
作者: Mengping Yang,Zhiyu Tan,Binglei Li,Xiaomeng Yang,Hesen Chen,Hao Li
机构: Shanghai Academy of AI for Science (上海人工智能科学研究院); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPR 2026, GitHub Code: this https URL , Project Page: this https URL

点击查看摘要

Abstract:Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs’ capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.

[CV-26] Nearest-Neighbor Density Estimation for Dependency Suppression

【速读】:该论文旨在解决数据中非敏感变量与敏感变量之间存在 unwanted dependencies(非期望依赖关系)的问题,这在公平性(fairness)、鲁棒学习(robust learning)和隐私保护等领域至关重要。解决方案的关键在于提出一种基于编码器的方法,通过显式估计并调整数据分布来消除统计依赖关系,而非依赖传统的去相关(decorrelation)或对抗学习(adversarial learning)策略。具体而言,该方法结合了专用的变分自编码器(variational autoencoder)与一种由非参数最近邻密度估计驱动的新颖损失函数,从而实现对独立性的直接优化,在多个数据集上验证表明其能有效平衡信息去除与保留,性能优于现有无监督方法,甚至可媲美监督方法。

链接: https://arxiv.org/abs/2603.04224
作者: Kathleen Anderson,Thomas Martinetz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to remove unwanted dependencies from data is crucial in various domains, including fairness, robust learning, and privacy protection. In this work, we propose an encoder-based approach that learns a representation independent of a sensitive variable but otherwise preserving essential data characteristics. Unlike existing methods that rely on decorrelation or adversarial learning, our approach explicitly estimates and modifies the data distribution to neutralize statistical dependencies. To achieve this, we combine a specialized variational autoencoder with a novel loss function driven by non-parametric nearest-neighbor density estimation, enabling direct optimization of independence. We evaluate our approach on multiple datasets, demonstrating that it can outperform existing unsupervised techniques and even rival supervised methods in balancing information removal and utility.

[CV-27] Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在数字文档基准测试中表现优异,但在真实物理场景下性能未知的问题,即“现实差距”(reality gap)的量化与诊断难题。其解决方案的关键在于构建首个全面、可控且高保真的物理重建基准——Real5-OmniDocBench,该基准对OmniDocBench v1.5的全部1,355张图像进行了五类典型物理干扰(扫描、形变、屏幕拍照、光照变化和倾斜)的一一对应重建,并通过完整的真值映射实现对性能退化因素的精细化归因分析,从而精准识别失败是由几何畸变、光学伪影还是模型局限性导致,为开发真正鲁棒的文档智能系统提供可解释的诊断工具。

链接: https://arxiv.org/abs/2603.04205
作者: Changda Zhou,Ziyue Gao,Xueqing Wang,Tingquan Gao,Cheng Cui,Jing Tang,Yi Liu
机构: PaddlePaddle Team, Baidu Inc.(百度飞桨团队); Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the ‘reality gap’ in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.

[CV-28] NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction ICLR2026

【速读】:该论文旨在解决现有像素对齐(pixel-aligned)3D重建方法中存在的两个关键问题:一是难以恢复不可见区域的点云,导致场景表示不完整;二是由于依赖每条射线的独立预测,容易在重叠区域产生冗余结构,破坏几何物理合理性。解决方案的关键在于提出NOVA3R框架,其核心创新是引入一种场景令牌(scene-token)机制,通过聚合未姿态对齐图像的信息来学习全局、视角无关的场景表示,并结合基于扩散模型(diffusion-based)的3D解码器,实现非像素对齐的完整点云重建,从而在保持高精度的同时提升几何合理性与完整性。

链接: https://arxiv.org/abs/2603.04179
作者: Weirong Chen,Chuanxia Zheng,Ganlin Zhang,Andrea Vedaldi,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Oxford (牛津大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026. Project Page: this https URL

点击查看摘要

Abstract:We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.

[CV-29] PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters

【速读】:该论文旨在解决如何将已预训练的2D基础模型(foundation models)高效迁移到3D体素数据上,而无需重新训练、引入适配器(adapter)或改变网络结构的问题。传统方法在扩展至3D时往往需要大量计算资源和额外设计,限制了其灵活性与实用性。解决方案的关键在于提出PlaneCycle,一种无需训练、无参数增加的3D提升算子,通过在网络深度中循环地在HW(高度-宽度)、DW(深度-宽度)和DH(深度-高度)三个正交平面间分配空间聚合操作,实现渐进式3D融合,同时保留原始2D模型的归纳偏置(inductive bias)。这一机制使得任意2D网络均可直接用于3D任务,且在多个3D分类与分割基准上表现出接近甚至媲美全量微调3D模型的性能。

链接: https://arxiv.org/abs/2603.04165
作者: Yinghong Yu,Guangyuan Li,Jiancheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at this https URL.

[CV-30] Degradation-based augmented training for robust individual animal re-identification

【速读】:该论文旨在解决野生动物重识别(Wildlife Re-identification)中因图像退化因素导致的个体识别性能下降问题。现有基于深度度量学习的模型在面对真实世界中多样化的图像质量退化(如模糊、低光照、遮挡等)时,特征判别能力显著减弱,从而限制了生态学研究的应用效果。解决方案的关键在于提出一种增强训练框架,通过在训练阶段引入人工但多样化的图像退化操作(augmented training with diverse degradations),使深度特征提取器具备更强的鲁棒性。实验证明,仅对部分个体应用此类增强训练即可提升整体重识别准确率,甚至对未参与训练的个体也有效,最高可带来8.5%的Rank-1准确率提升,且首次构建了基于人类专家标注的真实退化动物图像数据集,为后续研究提供了基准与工具支持。

链接: https://arxiv.org/abs/2603.04163
作者: Thanos Polychronou,Lukáš Adam,Viktor Penchev,Kostas Papafitsoros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wildlife re-identification aims to recognise individual animals by matching query images to a database of previously identified individuals, based on their fine-scale unique morphological characteristics. Current state-of-the-art models for multispecies re- identification are based on deep metric learning representing individual identities by fea- ture vectors in an embedding space, the similarity of which forms the basis for a fast automated identity retrieval. Yet very often, the discriminative information of individual wild animals gets significantly reduced due to the presence of several degradation factors in images, leading to reduced retrieval performance and limiting the downstream eco- logical studies. Here, starting by showing that the extent of this performance reduction greatly varies depending on the animal species (18 wild animal datasets), we introduce an augmented training framework for deep feature extractors, where we apply artificial but diverse degradations in images in the training set. We show that applying this augmented training only to a subset of individuals, leads to an overall increased re-identification performance, under the same type of degradations, even for individuals not seen during training. The introduction of diverse degradations during training leads to a gain of up to 8.5% Rank-1 accuracy to a dataset of real-world degraded animal images, selected using human re-ID expert annotations provided here for the first time. Our work is the first to systematically study image degradation in wildlife re-identification, while introducing all the necessary benchmarks, publicly available code and data, enabling further research on this topic.

[CV-31] LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

【速读】:该论文旨在解决现有深度学习模型在局部特征建模与全局依赖关系捕捉方面的局限性,具体表现为卷积神经网络(CNN)受限于局部感受野,而Transformer在有效建模局部结构方面存在不足,且两者均面临模型复杂度高和可解释性差的问题。解决方案的关键在于提出一种基于可学习迭代收缩阈值算法(LISTA)的稀疏Transformer(LISTA-Transformer),该方法将LISTA稀疏编码机制与视觉Transformer深度融合,构建出具备自适应局部与全局特征协同机制的模型架构;同时利用连续小波变换将振动信号转换为时频图输入模型,从而提升特征提取的有效性。实验表明,该方法在CWRU数据集上故障识别率达98.5%,优于传统方法及现有基于Transformer的方法。

链接: https://arxiv.org/abs/2603.04146
作者: Shuang Liu,Lina Zhao,Tian Wang,Huaqing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 14 figures, conference paper

点击查看摘要

Abstract:Driven by the continuous development of models such as Multi-Layer Perceptron, Convolutional Neural Network (CNN), and Transformer, deep learning has made breakthrough progress in fields such as computer vision and natural language processing, and has been successfully applied in practical scenarios such as image classification and industrial fault diagnosis. However, existing models still have certain limitations in local feature modeling and global dependency capture. Specifically, CNN is limited by local receptive fields, while Transformer has shortcomings in effectively modeling local structures, and both face challenges of high model complexity and insufficient interpretability. In response to the above issues, we proposes the following innovative work: A sparse Transformer based on Learnable Iterative Shrinkage Threshold Algorithm (LISTA-Transformer) was designed, which deeply integrates LISTA sparse encoding with visual Transformer to construct a model architecture with adaptive local and global feature collaboration mechanism. This method utilizes continuous wavelet transform to convert vibration signals into time-frequency maps and inputs them into LISTA-Transformer for more effective feature extraction. On the CWRU dataset, the fault recognition rate of our method reached 98.5%, which is 3.3% higher than traditional methods and exhibits certain superiority over existing Transformer-based approaches.

[CV-32] HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans

【速读】:该论文旨在解决ORB-SLAM框架中基于k-majority的二值化词袋(BoW)方法因固有精度损失而导致视觉词汇质量下降的问题,特别是在层次树结构中误差累积和传播所引发的视觉词退化现象。其解决方案的关键在于提出一种分层二值化-实值化-再二值化(HBRB-BoW)的训练算法:在层次聚类过程中引入全局实值流,使描述子信息在高层节点保持高保真度,仅在叶节点进行最终二值化,从而显著提升视觉词典的区分度与结构完整性,增强复杂环境下的场景表示能力。

链接: https://arxiv.org/abs/2603.04144
作者: Minjae Lee,Sang-Min Choi,Gun-Woo Kim,Suwon Lee
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In visual simultaneous localization and mapping (SLAM), the quality of the visual vocabulary is fundamental to the system’s ability to represent environments and recognize locations. While ORB-SLAM is a widely used framework, its binary vocabulary, trained through the k-majority-based bag-of-words (BoW) approach, suffers from inherent precision loss. The inability of conventional binary clustering to represent subtle feature distributions leads to the degradation of visual words, a problem that is compounded as errors accumulate and propagate through the hierarchical tree structure. To address these structural deficiencies, this paper proposes hierarchical binary-to-real-and-back (HBRB)-BoW, a refined hierarchical binary vocabulary training algorithm. By integrating a global real-valued flow within the hierarchical clustering process, our method preserves high-fidelity descriptor information until the final binarization at the leaf nodes. Experimental results demonstrate that the proposed approach yields a more discriminative and well-structured vocabulary than traditional methods, significantly enhancing the representational integrity of the visual dictionary in complex environments. Furthermore, replacing the default ORB-SLAM vocabulary file with our HBRB-BoW file is expected to improve performance in loop closing and relocalization tasks.

[CV-33] Mask-Guided Attention Regulation for Anatomically Consistent Counterfactual CXR Synthesis

【速读】:该论文旨在解决基于扩散模型(diffusion-based editing)生成胸部X光片(CXR)时存在的两个核心问题:一是结构漂移(structural drift),即由于注意力机制的全局传播导致非目标区域出现不稳定的解剖结构扭曲;二是病灶表达不稳定,因细微且局部的病变产生的条件信号较弱且噪声大,难以实现精准可控的病理编辑。解决方案的关键在于提出一种推理阶段(inference-time)注意力调控框架:首先通过解剖感知注意力正则化模块(anatomy-aware attention regularization module),利用器官掩码对自注意力和解剖token交叉注意力进行门控,将结构交互限制在解剖ROI内以减少意外畸变;其次引入病灶引导模块(pathology-guided module),在早期去噪阶段增强目标肺区内的病灶token交叉注意力,并基于注意力集中能量执行轻量级潜在空间修正,从而实现病灶定位与范围的可控性提升。

链接: https://arxiv.org/abs/2603.04130
作者: Zichun Zhang,Weizhi Nie,Honglin Guo,Yuting Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counterfactual generation for chest X-rays (CXR) aims to simulate plausible pathological changes while preserving patient-specific anatomy. However, diffusion-based editing methods often suffer from structural drift, where stable anatomical semantics propagate globally through attention and distort non-target regions, and unstable pathology expression, since subtle and localized lesions induce weak and noisy conditioning signals. We present an inference-time attention regulation framework for reliable counterfactual CXR synthesis. An anatomy-aware attention regularization module gates self-attention and anatomy-token cross-attention with organ masks, confining structural interactions to anatomical ROIs and reducing unintended distortions. A pathology-guided module enhances pathology-token cross-attention within target lung regions during early denoising and performs lightweight latent corrections driven by an attention-concentration energy, enabling controllable lesion localization and extent. Extensive evaluations on CXR datasets show improved anatomical consistency and more precise, controllable pathological edits compared with standard diffusion editing, supporting localized counterfactual analysis and data augmentation for downstream tasks.

[CV-34] Crab: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

【速读】:该论文旨在解决多任务统一训练中因音频-视觉任务异质性(audio-visual task heterogeneity)导致的严重负迁移问题,即传统方法在联合训练时近55%的任务性能劣于单任务训练。其核心解决方案在于从数据和模型两个层面协同优化:一方面构建了AV-UIE v2数据集,包含约22.2万条样本、覆盖17个数据集与7类任务,并显式标注推理过程以增强跨任务关系建模;另一方面设计统一接口对齐异构任务形式,并提出交互感知LoRA(Interaction-aware LoRA, I-LoRA),通过动态路由机制显式建模任务间交互关系,从而缓解参数干扰。实验表明,该方案成功将负迁移转为正迁移,在近88%的任务上实现多任务优于单任务基准,显著提升了音频-视觉大语言模型(AV-LLM)的统一场景理解能力。

链接: https://arxiv.org/abs/2603.04128
作者: Dongnuan Cai,Henghui Du,Chang Zhou,Xi Chen,Dan Guo,Hongyuan Zhang,Xuelong Li,Di Hu
机构: 中国人民大学(renmin university of china)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab ^+ , a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datasets and 7 tasks, enabling the model to capture cross-task relationships at different levels of granularity. On the model side, we design a unified interface to align heterogeneous task formulations, and propose Interaction-aware LoRA (I-LoRA), which explicitly models inter-task relationships via dynamic routing to coordinate distinct audio-visual interaction patterns, mitigating parameter interference. Extensive experiments show Crab ^+ covers broader tasks than existing unified models while outperforming specialized models on various benchmarks. We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks. These results hold across diverse AV-LLM paradigms and are validated through in-depth visualization, positioning Crab ^+ as a robust step towards holistic audio-visual scene understanding.

[CV-35] A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

【速读】:该论文旨在解决少样本动作识别(Few-Shot Action Recognition, FS-AR)在现实世界开放场景中的局限性问题,即传统方法依赖封闭集假设,在面对未知类别时性能显著下降。为应对这一挑战,作者提出了一种基于特征残差判别器(Feature-Residual Discriminator, FR-Disc)的架构扩展方案,将先前适用于骨骼数据的开放集识别方法迁移至更复杂的时空视频数据域。其关键创新在于通过引入特征残差机制增强模型对未知类别的辨别能力,从而在不牺牲封闭集准确率的前提下,显著提升对未知动作的拒识性能,实现了少样本开放集动作识别(Few-Shot Open-Set Action Recognition, FSOS-AR)的新基准。

链接: https://arxiv.org/abs/2603.04125
作者: Stefano Berti,Giulia Pasquale,Lorenzo Natale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal data to the more complex video domain. Extensive experiments on five datasets demonstrate that while common open-set techniques provide only marginal gains, our FR-Disc significantly enhances unknown rejection capabilities without compromising closed-set accuracy, setting a new state-of-the-art for FSOS-AR. The project website, code, and benchmark are available at: this https URL.

[CV-36] xtBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression

【速读】:该论文旨在解决超低比特率图像压缩中难以同时保持小字体场景文本可识别性与整体图像视觉质量的问题。传统基于感兴趣区域(Region-of-interest, ROI)的码率分配策略虽能优先保护文本区域,但常以牺牲全局图像保真度为代价,形成局部准确性和全局质量之间的权衡。解决方案的关键在于引入由光学字符识别(OCR)提取的辅助文本信息,并以极低开销传输至解码端,使解码器能够利用该语义引导进行重建。具体而言,提出的方法TextBoost通过三项核心设计实现:(i) 自适应过滤OCR结果并生成引导图;(ii) 通过注意力引导融合模块将引导信息与解码特征校准融合;(iii) 在文本区域施加一致性正则化损失,确保重建文本自然融入场景,从而在不干扰全局率失真优化的前提下显著提升文本可读性。

链接: https://arxiv.org/abs/2603.04115
作者: Bingxin Wang,Yuan Lan,Zhaoyi Sun,Yang Xiang,Jie Sun
机构: Huawei Hong Kong Research Center (华为香港研究中心); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.

[CV-37] Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

【速读】:该论文旨在解决多模态遥感影像中跨模态翻译任务的局限性问题,即现有方法将每对模态视为独立任务,导致计算复杂度呈二次增长且难以泛化到未见过的模态组合。其解决方案的关键在于提出Any2Any框架,该框架将任意模态到任意模态的翻译建模为对场景共享潜在表示的推理过程,通过一个统一的潜在扩散模型将异构输入投影至几何对齐的潜在空间,实现模态特异性表征学习与语义映射的解耦;同时引入轻量级目标特定残差适配器,在不增加推理复杂度的前提下校正系统性潜在差异,从而支持在稀疏但连通监督下的高效学习。

链接: https://arxiv.org/abs/2603.04114
作者: Haoyang Chen,Jing Zhang,Hebaixu Wang,Shiqin Wang,Pohsun Huang,Jiayuan Li,Haonan Guo,Di Wang,Zheng Wang,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models will be available at this https URL.

[CV-38] Understanding Sources of Demographic Predictability in Brain MRI via Disentangling Anatomy and Contrast

【速读】:该论文旨在解决医学影像中由人口统计学特征(如年龄、性别和种族)引起的模型偏见问题,尤其关注脑部磁共振成像(brain MRI)中这些特征信号的来源混淆问题。传统分析方法难以区分解剖结构差异与成像采集条件带来的对比度变化,导致偏见缓解策略无法针对根本原因进行干预。解决方案的关键在于提出一种基于解耦表示学习(disentangled representation learning)的受控框架,将脑部MRI分解为两类表征:以解剖结构为核心的关注点表示(anatomy-focused representations),其抑制了采集因素的影响;以及仅捕捉采集依赖性特征的对比度嵌入(contrast embeddings)。通过在全图像、解剖表示和对比度嵌入上分别训练预测模型,可量化结构与采集因素对人口统计学信号的相对贡献,从而实现针对性的偏见缓解策略设计。

链接: https://arxiv.org/abs/2603.04113
作者: Mehmet Yigit Avci,Akshit Achara,Andrew King,Jorge Cardoso(and for the Alzheimer’s Disease Neuroimaging Initiative)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Demographic attributes such as age, sex, and race can be predicted from medical images, raising concerns about bias in clinical AI systems. In brain MRI, this signal may arise from anatomical variation, acquisition-dependent contrast differences, or both, yet these sources remain entangled in conventional analyses. Without disentangling them, mitigation strategies risk failing to address the underlying causes. We propose a controlled framework based on disentangled representation learning, decomposing brain MRI into anatomy-focused representations that suppress acquisition influence and contrast embeddings that capture acquisition-dependent characteristics. Training predictive models for age, sex, and race on full images, anatomical representations, and contrast-only embeddings allows us to quantify the relative contributions of structure and acquisition to the demographic signal. Across three datasets and multiple MRI sequences, we find that demographic predictability is primarily rooted in anatomical variation: anatomy-focused representations largely preserve the performance of models trained on raw images. Contrast-only embeddings retain a weaker but systematic signal that is dataset-specific and does not generalise across sites. These findings suggest that effective mitigation must explicitly account for the distinct anatomical and acquisition-dependent origins of the demographic signal, ensuring that any bias reduction generalizes robustly across domains.

[CV-39] Efficient Point Cloud Processing with High-Dimensional Positional Encoding and Non-Local MLPs

【速读】:该论文旨在解决多层感知机(Multi-Layer Perceptron, MLP)在点云处理中因复杂网络结构导致的可解释性差及应用受限问题。其核心挑战在于如何在保持高效计算的同时提升模型对局部与非局部几何信息的建模能力。解决方案的关键在于提出一种两阶段抽象与精炼(Abstraction and Refinement, ABS-REF)框架,并引入高维位置编码(High-dimensional Positional Encoding, HPE)模块,显式利用点云内在的位置信息,从而增强局部特征表示能力;同时,用高效的非局部MLP替代传统耗时的局部MLP操作,实现更优的信息更新机制。基于此,作者构建了HPENets系列网络,其在多个公开数据集上的实验表明,在显著降低浮点运算量(FLOPs)的前提下,性能优于现有主流方法如PointNeXt。

链接: https://arxiv.org/abs/2603.04099
作者: Yanmei Zou,Hongshan Yu,Yaonan Wang,Zhengeng Yang,Xieyuanli Chen,Kailun Yang,Naveed Akhtar
机构: Hunan University (湖南大学); National University of Defense Technology (国防科技大学); Hunan Normal University (湖南师范大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Source code is available at this https URL

点击查看摘要

Abstract:Multi-Layer Perceptron (MLP) models are the foundation of contemporary point cloud processing. However, their complex network architectures obscure the source of their strength and limit the application of these models. In this article, we develop a two-stage abstraction and refinement (ABS-REF) view for modular feature extraction in point cloud processing. This view elucidates that whereas the early models focused on ABS stages, the more recent techniques devise sophisticated REF stages to attain performance advantages. Then, we propose a High-dimensional Positional Encoding (HPE) module to explicitly utilize intrinsic positional information, extending the ``positional encoding’’ concept from Transformer literature. HPE can be readily deployed in MLP-based architectures and is compatible with transformer-based methods. Within our ABS-REF view, we rethink local aggregation in MLP-based methods and propose replacing time-consuming local MLP operations, which are used to capture local relationships among neighbors. Instead, we use non-local MLPs for efficient non-local information updates, combined with the proposed HPE for effective local information representation. We leverage our modules to develop HPENets, a suite of MLP networks that follow the ABS-REF paradigm, incorporating a scalable HPE-based REF stage. Extensive experiments on seven public datasets across four different tasks show that HPENets deliver a strong balance between efficiency and effectiveness. Notably, HPENet surpasses PointNeXt, a strong MLP-based counterpart, by 1.1% mAcc, 4.0% mIoU, 1.8% mIoU, and 0.2% Cls. mIoU, with only 50.0%, 21.5%, 23.1%, 44.4% of FLOPs on ScanObjectNN, S3DIS, ScanNet, and ShapeNetPart, respectively. Source code is available at this https URL.

[CV-40] CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

【速读】:该论文旨在解决多视角植物图像中学习鲁棒预测模型的挑战,特别是由视角冗余和视角依赖的外观变化导致的性能下降问题。其解决方案的关键在于提出一种层级感知的视觉语言框架(level-aware vision language framework),该框架基于CLIP嵌入构建单个多任务模型,联合预测植物年龄和叶数;通过将旋转视角聚合为角度不变的表示,并利用轻量级文本先验编码视角层级信息来条件化视觉特征,从而在输入不完整或无序的情况下实现稳定预测。

链接: https://arxiv.org/abs/2603.04091
作者: Simon Warmers,Muhammad Zawish,Fayaz Ali Dharejo,Steven Davy,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE Conference

点击查看摘要

Abstract:Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: this https URL

[CV-41] Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints – Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

【速读】:该论文旨在解决在极小图像补丁(40×40像素)条件下,现代深度学习架构与基础模型是否仍能学习到鲁棒且可扩展的细胞级病理图像表征这一关键问题。其解决方案的关键在于系统性地比较了任务特定架构与基础模型(如Vision Transformer)在不同数据规模下的性能表现,发现针对小尺度图像优化的任务专用模型(如CustomViT)在充足训练数据支持下,不仅准确率更高,且推理成本显著低于基础模型;同时指出高干净准确率并不等同于更强的鲁棒性,基础模型在小补丁场景中并未展现出明显优势。

链接: https://arxiv.org/abs/2603.04081
作者: Hiroki Kagiyama,Toru Nagasaka,Yukari Adachi,Takaaki Tachibana,Ryota Ito,Mitsugu Fujita,Kimihiro Yamashita,Yoshihiro Kakeji
机构: Kobe University of Medical Sciences (神户大学医学部); UMIN (日本医学研究网络); Kobe University (神户大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256–16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.

[CV-42] uning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

【速读】:该论文旨在解决多文本编码器(multi-text encoder)扩散模型中文本引导后门攻击(text-based backdoor attacks)的效率与有效性问题。随着Stable Diffusion 3等模型引入三个独立的文本编码器,其参数量显著增加,现有针对单编码器的攻击方法难以直接适用,且缺乏对多编码器场景下后门攻击潜力的系统性分析。解决方案的关键在于提出一种名为MELT(Multi-Encoder Lightweight aTtacks)的新方法:通过仅训练低秩适配器(low-rank adapters)并冻结预训练文本编码器权重,实现对少于0.2%总编码器参数的微调即可完成有效后门攻击,从而揭示了在多编码器架构中仍存在高效且隐蔽的攻击路径,填补了当前对复杂文本编码结构安全性的研究空白。

链接: https://arxiv.org/abs/2603.04064
作者: Ziyuan Chen,Yujin Jeong,Tobias Braun,Anna Rohrbach
机构: TU Darmstadt (达姆施塔特工业大学); Hessian Center for Artificial Intelligence (hessian.AI) (黑森州人工智能中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As text-to-image diffusion models become increasingly deployed in real-world applications, concerns about backdoor attacks have gained significant attention. Prior work on text-based backdoor attacks has largely focused on diffusion models conditioned on a single lightweight text encoder. However, more recent diffusion models that incorporate multiple large-scale text encoders remain underexplored in this context. Given the substantially increased number of trainable parameters introduced by multiple text encoders, an important question is whether backdoor attacks can remain both efficient and effective in such settings. In this work, we study Stable Diffusion 3, which uses three distinct text encoders and has not yet been systematically analyzed for text-encoder-based backdoor vulnerabilities. To understand the role of text encoders in backdoor attacks, we define four categories of attack targets and identify the minimal sets of encoders required to achieve effective performance for each attack objective. Based on this, we further propose Multi-Encoder Lightweight aTtacks (MELT), which trains only low-rank adapters while keeping the pretrained text encoder weight frozen. We demonstrate that tuning fewer than 0.2% of the total encoder parameters is sufficient for successful backdoor attacks on Stable Diffusion 3, revealing previously underexplored vulnerabilities in practical attack scenarios in multi-encoder settings.

[CV-43] umorFlow: Physics-Guided Longitudinal MRI Synthesis of Glioblastoma Growth

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma)生长模式具有高度异质性、浸润性和患者特异性,而常规磁共振成像(MRI)难以准确反映其真实范围的问题,从而影响个体化治疗规划与随访评估的可靠性。解决方案的关键在于提出一种生物物理条件化的生成框架,该框架通过结合生成模型与肿瘤浸润图谱,并利用生物物理生长模型在时间上传播肿瘤浓度场,实现对肿瘤形态和生长过程的精细控制,同时保持患者解剖结构的一致性。此方法能够在真实患者的影像空间中合成连贯的肿瘤生长轨迹,提供超越影像显式观测范围的可解释、可控的肿瘤浸润与进展估计,且在纵向外推中实现了与生物物理模型75%的Dice重叠率和恒定25 dB的周边组织峰值信噪比(PSNR)。

链接: https://arxiv.org/abs/2603.04058
作者: Valentin Biller,Niklas Bubeck,Lucas Zimmer,Ayhan Can Erdur,Sandeep Nagar,Anke Meyer-Baese,Daniel Rückert,Benedikt Wiestler,Jonas Weidner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glioblastoma exhibits diverse, infiltrative, and patient-specific growth patterns that are only partially visible on routine MRI, making it difficult to reliably assess true tumor extent and personalize treatment planning and follow-up. We present a biophysically-conditioned generative framework that synthesizes biologically realistic 3D brain MRI volumes from estimated, spatially continuous tumor-concentration fields. Our approach combines a generative model with tumor-infiltration maps that can be propagated through time using a biophysical growth model, enabling fine-grained control over tumor shape and growth while preserving patient anatomy. This enables us to synthesize consistent tumor growth trajectories directly in the space of real patients, providing interpretable, controllable estimation of tumor infiltration and progression beyond what is explicitly observed in imaging. We evaluate the framework on longitudinal glioblastoma cases and demonstrate that it can generate temporally coherent sequences with realistic changes in tumor appearance and surrounding tissue response. These results suggest that integrating mechanistic tumor growth priors with modern generative modeling can provide a practical tool for patient-specific progression visualization and for generating controlled synthetic data to support downstream neuro-oncology workflows. In longitudinal extrapolation, we achieve a consistent 75% Dice overlap with the biophysical model while maintaining a constant PSNR of 25 in the surrounding tissue. Our code is available at: this https URL

[CV-44] Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset Footprint-Based Ground Truth and Visual Place Recognition Benchmark

【速读】:该论文旨在解决长期水下视觉定位(long-term visual localization)在海底环境中的研究空白问题,具体包括缺乏用于基准测试的高质量数据集以及现有地面真值标注方法在复杂地形下可能导致性能高估的问题。其解决方案的关键在于:首先构建了一个多站点、跨时长(最长六年)的海底参考区域影像数据集,包含经地理配准的AUV立体图像、相机标定参数及亚分米级精度的相机位姿;其次提出了一种基于三维海床图像足迹(3D seafloor image footprint)的新型地面真值标注方法,通过匹配重叠图像足迹来确保地面真值链接反映真实的视觉内容一致性,从而更准确地评估视觉位置识别(Visual Place Recognition, VPR)性能。这一方法显著优于传统基于距离阈值的位置匹配方式,在崎岖地形中避免了对VPR召回率(Recall@K)的过度估计。

链接: https://arxiv.org/abs/2603.04056
作者: Martin Kvisvik Larsen,Oscar Pizarro
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

[CV-45] DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中因传统对比学习框架导致的相关性抑制(relevance suppression)和语义混淆(semantic confusion)问题,这些问题使得查询表示缺乏细粒度属性修改下的区分度。解决方案的关键在于提出一种名为DQE-CIR的方法,其核心包括两个创新:一是引入可学习的属性权重机制,以增强与修改文本条件相关的视觉特征,实现语言与视觉特征更精确的对齐;二是设计目标相对负采样策略,构建目标相对相似度分布,并从排除易负样本和模糊伪负样本的中段区域中选取信息量丰富的负样本,从而提升细粒度属性变化下的查询判别能力并减少语义相近但无关候选者的干扰。

链接: https://arxiv.org/abs/2603.04037
作者: Geon Park,Ji-Hoon Park,Seong-Whan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages

点击查看摘要

Abstract:Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.

[CV-46] Volumetric Directional Diffusion: Anchoring Uncertainty Quantification in Anatomical Consensus for Ambiguous Medical Image Segmentation

【速读】:该论文旨在解决医学图像中3D病灶分割存在的不确定性量化不足问题,即传统确定性模型忽略偶然不确定性(aleatoric uncertainty),导致预测结果过于自信,而生成式方法(如标准扩散模型)虽能捕捉样本多样性,却常因从纯噪声中恢复复杂拓扑结构而导致结构性断裂和分布外解剖幻觉。解决方案的关键在于提出体积方向扩散(Volumetric Directional Diffusion, VDD),其通过数学上将生成轨迹锚定于一个确定性共识先验,限制生成搜索空间以迭代预测3D边界残差场,从而在不牺牲拓扑完整性的前提下精确探索专家间细微几何差异,实现高保真度与多样性的平衡。

链接: https://arxiv.org/abs/2603.04024
作者: Chao Wu,Kangxian Xie,Mingchen Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Equivocal 3D lesion segmentation exhibits high inter-observer variability. Conventional deterministic models ignore this aleatoric uncertainty, producing over-confident masks that obscure clinical risks. Conversely, while generative methods (e.g., standard diffusion) capture sample diversity, recovering complex topology from pure noise frequently leads to severe structural fractures and out-of-distribution anatomical hallucinations. To resolve this fidelity-diversity trade-off, we propose Volumetric Directional Diffusion (VDD). Unlike standard diffusion models that denoise isotropic Gaussian noise, VDD mathematically anchors the generative trajectory to a deterministic consensus prior. By restricting the generative search space to iteratively predict a 3D boundary residual field, VDD accurately explores the fine-grained geometric variations inherent in expert disagreements without risking topological collapse. Extensive validation on three multi-rater datasets (LIDC-IDRI, KiTS21, and ISBI 2015) demonstrates that VDD achieves state-of-the-art uncertainty quantification (significantly improving GED and CI) while remaining highly competitive in segmentation accuracy against deterministic upper bounds. Ultimately, VDD provides clinicians with anatomically coherent uncertainty maps, enabling safer decision-making and mitigating risks in downstream tasks (e.g., radiotherapy planning or surgical margin assessment).

[CV-47] Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在放射学报告生成(Radiology Report Generation, R2G)任务中临床实用性不足的问题,尤其是强化学习(Reinforcement Learning, RL)在该场景下的数据效率低和优化效果不佳。其核心解决方案包括两个关键创新:一是提出基于诊断多样性的数据采样策略,显著提升数据质量以减少所需训练样本量;二是设计诊断令牌加权策略优化(Diagnostic Token-weighted Policy Optimization, DiTPO),通过将临床诊断F1分数作为奖励信号,并采用规则或梯度驱动机制对不同令牌的重要性进行建模,从而优先优化具有临床意义的关键内容,避免低频但重要的诊断信息被忽略。实验表明,该框架在多个公开数据集上实现了最先进的性能,同时大幅降低RL训练样本需求。

链接: https://arxiv.org/abs/2603.04022
作者: Zilin Lu,Ruifeng Yuan,Weiwei Cao,Wanxing Chang,Zhongyu Wei,Sinuo Wang,Yong Xia,Ling Zhang,Jianpeng Zhang
机构: DAMO Academy (达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.

[CV-48] Discriminative Perception via Anchored Description for Reasoning Segmentation CVPR2026

【速读】:该论文旨在解决推理分割(Reasoning Segmentation)中因强化学习(Reinforcement Learning, RL)奖励机制仅关注最终定位结果,而无法有效区分推理过程是否始终聚焦于目标区域与背景干扰之间的局限性问题。这种缺乏判别性感知(Discriminative Perception)的机制导致模型生成冗长且发散的推理链,难以在复杂场景中准确定位目标。解决方案的关键在于提出DPAD(Discriminative Perception via Caption Alignment),通过强制模型生成对所指对象的描述性caption,并利用该caption在语义层面对比目标与上下文的差异,从而显式地引导模型聚焦于目标的独特属性,提升推理链的收敛性和效率。此方法不仅优化了分割性能(如ReasonSeg上的cIoU提升3.09%),还显著缩短了推理链长度(减少约42%),同时提供可解释的推理依据。

链接: https://arxiv.org/abs/2603.04002
作者: Tao Yang,Qing Zhou,Yanliang Li,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model’s reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption’s semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at this https URL

[CV-49] Weakly Supervised Patch Annotation for Improved Screening of Diabetic Retinopathy

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期筛查中因病灶(lesion)标注稀疏且不完整而导致的深度学习模型性能受限问题。现有方法多依赖图像级监督或弱监督定位,难以系统性扩展未标注区域的细粒度标注,而专家手动标注又存在劳动强度大、覆盖不全等缺陷。其解决方案的关键在于提出一种两阶段框架——基于特征空间集成的相似性标注(Similarity-based Annotation via Feature-space Ensemble, SAFE),第一阶段通过双臂Patch Embedding Network学习语义结构清晰、类别判别性强的嵌入表示;第二阶段利用多个独立嵌入空间的集成策略,依据空间与语义邻近性将标签外推至未标注区域,并引入弃权机制平衡高置信度标注与噪声覆盖之间的权衡,从而在部分临床监督下实现对病灶细节的系统性扩展标注,显著提升下游任务如DR分类的性能(F1-score和AUPRC均大幅提升)。

链接: https://arxiv.org/abs/2603.03991
作者: Shramana Dey,Abhirup Banerjee,B. Uma Shankar,Ramachandran Rajalakshmi,Sushmita Mitra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) requires timely screening to prevent irreversible vision loss. However, its early detection remains a significant challenge since often the subtle pathological manifestations (lesions) get overlooked due to insufficient annotation. Existing literature primarily focuses on image-level supervision, weakly-supervised localization, and clustering-based representation learning, which fail to systematically annotate unlabeled lesion region(s) for refining the dataset. Expert-driven lesion annotation is labor-intensive and often incomplete, limiting the performance of deep learning models. We introduce Similarity-based Annotation via Feature-space Ensemble (SAFE), a two-stage framework that unifies weak supervision, contrastive learning, and patch-wise embedding inference, to systematically expand sparse annotations in the pathology. SAFE preserves fine-grained details of the lesion(s) under partial clinical supervision. In the first stage, a dual-arm Patch Embedding Network learns semantically structured, class-discriminative embeddings from expert annotated patches. Next, an ensemble of independent embedding spaces extrapolates labels to the unannotated regions based on spatial and semantic proximity. An abstention mechanism ensures trade-off between highly reliable annotation and noisy coverage. Experimental results demonstrate reliable separation of healthy and diseased patches, achieving upto 0.9886 accuracy. The annotation generated from SAFE substantially improves downstream tasks such as DR classification, demonstrating a substantial increase in F1-score of the diseased class and a performance gain as high as 0.545 in Area Under the Precision-Recall Curve (AUPRC). Qualitative analysis, with explainability, confirms that SAFE focuses on clinically relevant lesion patterns; and is further validated by ophthalmologists.

[CV-50] When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

【速读】:该论文旨在解决视觉模型在面对模糊或歧义视觉证据时如何判断是否将非人脸物体中的面状模式解释为有意义的面孔问题,即探索模型在不确定性下的行为机制。其解决方案的关键在于引入一个基于表示层的诊断框架,通过统一协议评估六种不同代表范式的模型(包括视觉-语言模型、纯视觉分类模型、通用目标检测模型和人脸检测模型)在面状幻觉(face pareidolia)图像上的检测、定位、不确定性和偏倚表现,从而揭示三种核心解释机制:视觉-语言模型表现出语义过激活现象,倾向于将模糊区域强行归类为人脸概念;纯视觉分类模型采用“不确定性即回避”策略,保持扩散且低偏倚;检测类模型则依靠保守先验抑制幻觉响应。研究发现,模型在歧义下的行为主要由表征选择决定而非阈值设定,且不确定性和偏倚可解耦——低不确定性可能意味着安全抑制(如检测模型),也可能意味着极端过度解读(如VLMs)。这表明面状幻觉提供了一种紧凑的诊断工具和用于提升视觉-语言系统语义鲁棒性的难样本负例来源。

链接: https://arxiv.org/abs/2603.03989
作者: Qianpu Chen,Derya Soydaner,Rob Saunders
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.

[CV-51] RIVER: A Real-Time Interaction Benchmark for Video LLM s

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)普遍采用离线处理范式导致实时交互能力受限的问题。现有模型虽在单次问答任务中表现优异,但在需要持续理解、记忆和预测的在线视频交互场景中表现不佳,尤其在长时记忆保持与未来感知能力方面存在明显短板。解决方案的关键在于提出一个全新的评估基准 RIVER Bench,其包含回溯记忆(Retrospective Memory)、实时感知(Live-Perception)和主动预测(Proactive Anticipation)三类任务,模拟真实交互对话流程,并基于多样化视频数据构建了精确的实时交互格式;同时提出一种通用改进方法,显著提升模型在实时交互中的灵活性与连贯性,从而推动生成式 AI(Generative AI)在动态视频理解领域的实际应用发展。

链接: https://arxiv.org/abs/2603.03985
作者: Yansong Shi,Qingsong Zhao,Tianxiang Jiang,Xiangyu Zeng,Yi Wang,Limin Wang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at this https URL.

[CV-52] GeoSeg: Training-Free Reasoning -Driven Segmentation in Remote Sensing Imagery

【速读】:该论文旨在解决遥感图像中基于推理的分割任务缺乏通用解决方案的问题,尤其针对因推理导向数据成本高昂及俯视视角等域特定挑战导致的性能瓶颈。其核心解决方案是提出GeoSeg框架,该框架无需训练即可实现零样本推理驱动的遥感分割,关键创新在于:(1) 偏差感知的坐标精修机制,用于校正系统性定位偏移;(2) 双路径提示机制,融合语义意图与细粒度空间线索以提升定位精度。

链接: https://arxiv.org/abs/2603.03983
作者: Lifan Jiang,Yuhang Pei,oxi Wu,Yan Zhao,Tianrun Wu,Shulong Yu,Lihui Zhang,Deng Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image–query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.

[CV-53] Phi-4-reasoning -vision-15B Technical Report

【速读】:该论文旨在解决如何构建更小、更高效的多模态推理模型,以在保持高性能的同时显著降低训练和推理阶段的计算资源消耗。其核心挑战在于平衡模型规模与多模态理解能力(如视觉、语言及科学数学推理)之间的关系。解决方案的关键在于三个维度:首先,通过系统性数据筛选、错误纠正与合成增强提升数据质量,验证了高质量数据是模型性能提升的核心驱动力;其次,采用高分辨率及动态分辨率编码器,强化感知精度以支撑高质量推理;最后,引入包含推理与非推理数据的混合训练策略,并辅以显式模式标记(mode tokens),使单一模型能够根据任务复杂度灵活切换直接回答或链式思维(chain-of-thought)推理模式,从而实现效率与能力的协同优化。

链接: https://arxiv.org/abs/2603.03975
作者: Jyoti Aneja,Michael Harrison,Neel Joshi,Tyler LaBonte,John Langford,Eduardo Salinas
机构: Microsoft Research
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation – reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

[CV-54] Dual-Solver: A Generalized ODE Solver for Diffusion Models with Dual Prediction ICLR2026

【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理阶段采样成本高的问题,即需要大量函数求值次数(NFEs)才能生成高质量图像。传统方法虽采用经典常微分方程(ODE)数值求解策略以减少NFEs,但预测类型(prediction type)和积分域(integration domain)的选择会显著影响采样行为。其解决方案的关键在于提出Dual-Solver,该方法通过可学习参数实现三重连续调节:(i) 在不同预测类型间插值,(ii) 自适应选择积分域,(iii) 调整残差项;同时保留标准的预测-校正结构并维持二阶局部精度。这些参数通过冻结的预训练分类器(如MobileNet或CLIP)定义的分类目标进行优化,在ImageNet条件生成与文本到图像生成任务中均能在低NFE(3 ≤ NFE ≤ 9)条件下提升FID和CLIP分数。

链接: https://arxiv.org/abs/2603.03973
作者: Soochul Park,Yeon Ju Lee
机构: Korea University (韩国大学); MODULABS
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2026. 36 pages, 18 figures

点击查看摘要

Abstract:Diffusion models achieve state-of-the-art image quality. However, sampling is costly at inference time because it requires a large number of function evaluations (NFEs). To reduce NFEs, classical ODE numerical methods have been adopted. Yet, the choice of prediction type and integration domain leads to different sampling behaviors. To address these issues, we introduce Dual-Solver, which generalizes multistep samplers through learnable parameters that continuously (i) interpolate among prediction types, (ii) select the integration domain, and (iii) adjust the residual terms. It retains the standard predictor-corrector structure while preserving second-order local accuracy. These parameters are learned via a classification-based objective using a frozen pretrained classifier (e.g., MobileNet or CLIP). For ImageNet class-conditional generation (DiT, GM-DiT) and text-to-image generation (SANA, PixArt- \alpha ), Dual-Solver improves FID and CLIP scores in the low-NFE regime ( 3 \le NFE \le 9 ) across backbones.

[CV-55] Scaling Dense Event-Stream Pretraining from Visual Foundation Models

【速读】:该论文旨在解决从不规则事件流(event streams)中学习通用且细粒度表征的难题,其核心挑战在于标注成本高,限制了数据集规模、语义丰富性和应用范围。为缓解这一困境,作者提出一种新颖的自监督预训练方法,通过蒸馏视觉基础模型(Visual Foundation Models, VFMs)来规模化提升事件表征能力。解决方案的关键在于引入一种结构感知蒸馏损失(structure-aware distillation loss),该损失利用VFMs提供的现成语义结构信息,扩展对齐目标的接受域并提供更强监督信号,从而在高分辨率下避免语义坍塌,优化密集事件表征,并显著提升下游任务中的泛化性、数据效率和迁移能力。

链接: https://arxiv.org/abs/2603.03969
作者: Zhiwen Chen,Junhui Hou,Zhiyu Zhu,Jinjian Wu,Guangming Shi
机构: City University of Hong Kong (香港城市大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.

[CV-56] UniRain: Unified Image Deraining with RAG -based Dataset Distillation and Multi-objective Reweighted Optimization CVPR2026

【速读】:该论文旨在解决现有图像去雨方法在真实场景中泛化能力不足的问题,即大多数方法仅针对特定类型的雨害退化(如雨丝或雨滴)进行设计,难以适应多样化的实际降雨场景。其解决方案的关键在于提出一个统一的图像去雨框架UniRain,能够同时处理白天和夜间条件下由雨丝和雨滴引起的图像退化;并通过基于智能检索增强生成(Retrieval-Augmented Generation, RAG)的数据集蒸馏管道,从多个公开数据集中筛选高质量样本用于混合训练,从而提升模型的泛化性能;此外,引入一种简单而有效的多目标重加权优化策略嵌入到非对称专家混合(Mixture-of-Experts, MoE)架构中,以确保在不同场景下保持稳定且鲁棒的性能表现。

链接: https://arxiv.org/abs/2603.03967
作者: Qianfeng Yang,Qiyuan Guan,Xiang Chen,Jiyu Jin,Guiyue Jin,Jiangxin Dong
机构: Dalian Polytechnic University (大连工业大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026; Project Page: this https URL

点击查看摘要

Abstract:Despite significant progress has been made in image deraining, we note that most existing methods are often developed for only specific types of rain degradation and fail to generalize across diverse real-world rainy scenes. How to effectively model different rain degradations within a universal framework is important for real-world image deraining. In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. To better enhance unified model generalization, we construct an intelligent retrieval augmented generation (RAG)-based dataset distillation pipeline that selects high-quality training samples from all public deraining datasets for better mixed training. Furthermore, we incorporate a simple yet effective multi-objective reweighted optimization strategy into the asymmetric mixture-of-experts (MoE) architecture to facilitate consistent performance and improve robustness across diverse scenes. Extensive experiments show that our framework performs favorably against the state-of-the-art models on our proposed benchmarks and multiple public datasets.

[CV-57] BLOCK: An Open-Source Bi-Stage MLLM Character-to-Skin Pipeline for Minecraft

【速读】:该论文旨在解决从任意角色概念(character concept)到像素级精确的Minecraft皮肤(skin)自动生成的问题,其核心挑战在于如何保持角色视觉一致性并准确映射至Minecraft的特定纹理格式。解决方案的关键在于提出一个双阶段(bi-stage)管道BLOCK:第一阶段为3D预览合成阶段,利用大语言多模态模型(MLLM)结合精心设计的提示模板与参考图像,生成一致性的双面板(前/后视图)斜视角Minecraft风格预览;第二阶段为皮肤解码阶段,基于微调后的FLUX.2模型将预览图转化为皮肤贴图(skin atlas image)。此外,论文创新性地提出EvolveLoRA——一种渐进式LoRA课程学习策略(文本到图像 → 图像到图像 → 预览到皮肤),通过前一阶段适配器初始化后续阶段,显著提升生成稳定性与效率。

链接: https://arxiv.org/abs/2603.03964
作者: Hengquan Guo
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present \textbfBLOCK, an open-source bi-stage character-to-skin pipeline that generates pixel-perfect Minecraft skins from arbitrary character concepts. BLOCK decomposes the problem into (i) a \textbf3D preview synthesis stage driven by a large multimodal model (MLLM) with a carefully designed prompt-and-reference template, producing a consistent dual-panel (front/back) oblique-view Minecraft-style preview; and (ii) a \textbfskin decoding stage based on a fine-tuned FLUX.2 model that translates the preview into a skin atlas image. We further propose \textbfEvolveLoRA, a progressive LoRA curriculum (text-to-image \rightarrow image-to-image \rightarrow preview-to-skin) that initializes each phase from the previous adapter to improve stability and efficiency. BLOCK is released with all prompt templates and fine-tuned weights to support reproducible character-to-skin generation.

[CV-58] ProFound: A moderate-sized vision foundation model for multi-task prostate imaging

【速读】:该论文旨在解决多参数磁共振成像(multi-parametric MRI, mpMRI)在前列腺癌诊断与治疗中自动化任务的瓶颈问题,即现有模型依赖大量特定任务标注数据、难以规模化部署且临床通用性受限。其解决方案的关键在于构建一个领域专用的视觉基础模型 ProFound,通过在包含5,000名患者、超过22,000个三维MRI体数据集上采用多种自监督预训练策略进行大规模无监督学习,从而获得强大的泛化能力;随后在11项下游临床任务(如肿瘤检测、Gleason分级、病灶定位等)上微调验证,结果表明ProFound在多数任务中显著优于或至少媲美现有最优专用模型,证明了该方法在减少对标注数据依赖的同时提升临床实用性。

链接: https://arxiv.org/abs/2603.03961
作者: Yipei Wang,Yinsong Xu,Weixi Yi,Shaheer Ullah Saeed,Natasha Thorley,Alexander Ng,Yukun Zhou,Wen Yan,Dean Barratt,Shonit Punwani,Veeru Kasivisvanathan,Mark Emberton,Daniel C. Alexander,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many diagnostic and therapeutic clinical tasks for prostate cancer increasingly rely on multi-parametric MRI. Automating these tasks is challenging because they necessitate expert interpretations, which are difficult to scale to capitalise on modern deep learning. Although modern automated systems achieve expert-level performance in isolated tasks, their general clinical utility remains limited by the requirement of large task-specific labelled datasets. In this paper, we present ProFound, a domain-specialised vision foundation model for volumetric prostate mpMRI. ProFound is pre-trained using several variants of self-supervised approaches on a diverse, multi-institutional collection of 5,000 patients, with a total of over 22,000 unique 3D MRI volumes (over 1,800,000 2D image slices). We conducted a systematic evaluation of ProFound across a broad spectrum of 11 downstream clinical tasks on over 3,000 independent patients, including prostate cancer detection, Gleason grading, lesion localisation, gland volume estimation, zonal and surrounding structure segmentation. Experimental results demonstrate that finetuned ProFound consistently outperforms or remains competitive with state-of-the-art specialised models and existing medical vision foundation models trained/finetuned on the same data.

[CV-59] Structural Action Transformer for 3D Dexterous Manipulation CVPR

【速读】:该论文旨在解决高自由度(High-DoF)机器人手在模仿学习中因异构体态(heterogeneous embodiments)导致的技能迁移难题,尤其是现有方法依赖二维观测和以时间为中心的动作表示时,难以捕捉三维空间关系并处理体态差异的问题。其解决方案的关键在于提出结构中心视角的结构动作Transformer(Structural Action Transformer, SAT),将每个动作片段重新建模为无序、可变长度的关节级轨迹序列,而非传统的时间序列;通过引入嵌入关节功能角色与运动学特性的体态关节码本(Embodied Joint Codebook),编码结构先验并消除歧义,并利用连续时间流匹配目标从三维点云中学习生成这些轨迹,从而实现对异构机器人的原生支持与高效跨体态技能迁移。

链接: https://arxiv.org/abs/2603.03960
作者: Xiaohan Lei,Min Wang,Bohong Weng,Wengang Zhou,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR

点击查看摘要

Abstract:Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint’s functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

[CV-60] owards Generalized Multimodal Homography Estimation

【速读】:该论文旨在解决图像同图变换矩阵(Homography)估计方法在面对未见模态(unseen modalities)时性能显著下降的问题。现有监督与无监督方法依赖于特定模态的图像对进行训练,导致泛化能力受限。其解决方案的关键在于提出一种训练数据合成方法,仅需单张输入图像即可生成具有真实偏移量(ground-truth offsets)的未对齐图像对,同时保留结构信息并引入多样的纹理和颜色;此外,设计了一种网络架构以充分利用跨尺度信息并解耦颜色特征,从而提升模型在不同域间的鲁棒性和估计精度。

链接: https://arxiv.org/abs/2603.03956
作者: Jinkun You,Jiaxin Cheng,Jie Zhang,Yicong Zhou
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.

[CV-61] RVN-Bench: A Benchmark for Reactive Visual Navigation

【速读】:该论文旨在解决室内移动机器人在复杂环境中的安全视觉导航问题,现有基准测试往往忽略碰撞风险或仅适用于室外场景,难以满足室内应用需求。解决方案的关键在于提出一个名为RVN-Bench(Reactive Visual Navigation Benchmark)的碰撞感知基准,其核心特征包括:基于Habitat 2.0仿真平台和高保真HM3D室内场景构建大规模、多样化的测试环境;定义了仅依赖视觉观测、无需先验地图即可完成序列目标位置到达且避免碰撞的任务范式;提供标准化的训练与评估工具,支持在线强化学习和离线数据生成(如包含碰撞事件的负样本轨迹图像数据集),从而有效提升策略在未见环境中的泛化能力,验证了其作为安全、鲁棒视觉导航标准基准的价值。

链接: https://arxiv.org/abs/2603.03953
作者: Jaewon Lee,Jaeseok Heo,Gunmin Lee,Howoong Jun,Jeongwoo Oh,Songhwai Oh
机构: Seoul National University (首尔国立大学); Automation and Systems Research Institute (自动化与系统研究所); Sequor Robotics Inc. (Sequor机器人公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN-Bench), a collision-aware benchmark for indoor mobile robots. In RVN-Bench, an agent must reach sequential goal positions in previously unseen environments using only visual observations and no prior map, while avoiding collisions. Built on the Habitat 2.0 simulator and leveraging high-fidelity HM3D scenes, RVN-Bench provides large-scale, diverse indoor environments, defines a collision-aware navigation task and evaluation metrics, and offers tools for standardized training and benchmarking. RVN-Bench supports both online and offline learning by offering an environment for online reinforcement learning, a trajectory image dataset generator, and tools for producing negative trajectory image datasets that capture collision events. Experiments show that policies trained on RVN-Bench generalize effectively to unseen environments, demonstrating its value as a standardized benchmark for safe and robust visual navigation. Code and additional materials are available at: this https URL.

[CV-62] Spatial Causal Prediction in Video CVPR

【速读】:该论文旨在解决当前模型在空间因果推理(spatial causal reasoning)能力上的不足,特别是其在处理未观测时空状态时的局限性,即现有方法多聚焦于可见时空理解,而忽视了对不可见过去或未来空间状态的推断能力。解决方案的关键在于提出一种新的任务范式——空间因果预测(Spatial Causal Prediction, SCP),要求模型超越直接观察进行因果推理,并构建了SCP-Bench基准,包含2,500个问答对和1,181段跨视角、场景与因果方向的视频数据,以系统评估模型的空间因果智能。通过23个前沿模型的实验,揭示了人类与模型之间显著的性能差距、有限的时间外推能力和薄弱的因果基础,进而提出了感知增强与推理引导策略,推动空间因果智能的发展。

链接: https://arxiv.org/abs/2603.03944
作者: Yanguang Zhao,Jie Yang,Shengqiong Wu,Shutong Hu,Hongbo Qiu,Yu Wang,Guijia Zhang,Tan Kai Ze,Hao Fei,Chia-Wen Lin,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore (新加坡国立大学); Shenzhen University (深圳大学); Sichuan University (四川大学); National Tsing Hua University (台湾清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 21 figures, 17 tables, CVPR findings

点击查看摘要

Abstract:Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on 23 state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is this https URL.

[CV-63] Slice-wise quality assessment of high b-value breast DWI via deep learning-based artifact detection

【速读】:该论文旨在解决高b值扩散加权成像(high b-value diffusion-weighted imaging, DWI)在乳腺磁共振成像(breast MRI)中因强度伪影(包括高信号和低信号伪影)导致诊断图像评估受影响的问题。解决方案的关键在于采用深度学习方法,特别是基于卷积神经网络(CNN)的分类模型,对单切片DWI图像中的伪影进行检测与识别:研究比较了DenseNet121、ResNet18和SEResNet50三种架构,在二分类任务(伪影存在与否)和多类分类任务(伪影强度类型)中均表现出优异性能,其中DenseNet121在检测高信号和低信号伪影时分别达到0.92和0.94的AUROC,并通过Grad-CAM热图生成伪影定位框,由放射科医生主观评分验证其空间定位准确性,表明该方法具有临床应用潜力。

链接: https://arxiv.org/abs/2603.03941
作者: Ameya Markale,Luise Brock,Ihor Horishnyi,Dominika Skwierawska,Tri-Thien Nguyen,Hannes Schreiter,Shirin Heidarikahkesh,Lorenz A. Kapsner,Michael Uder,Sabine Ohlmeyer,Frederik B Laun,Andrzej Liebert,Sebastian Bickelhaupt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-weighted imaging (DWI) can support lesion detection and characterization in breast magnetic resonance imaging (MRI), however especially high b-value diffusion-weighted acquisitions can be prone to intensity artifacts that can affect diagnostic image assessment. This study aims to detect both hyper- and hypointense artifacts on high b-value diffusion-weighted images (b=1500 s/mm2) using deep learning, employing either a binary classification (artifact presence) or a multiclass classification (artifact intensity) approach on a slice-wise this http URL IRB-approved retrospective study used the single-center dataset comprising n=11806 slices from routine 3T breast MRI examinations performed between 2022 and mid-2023. Three convolutional neural network (CNN) architectures (DenseNet121, ResNet18, and SEResNet50) were trained for binary classification of hyper- and hypointense artifacts. The best performing model (DenseNet121) was applied to an independent holdout test set and was further trained separately for multiclass classification. Evaluation included area under receiver operating characteristic curve (AUROC), area under precision recall curve (AUPRC), precision, and recall, as well as analysis of predicted bounding box positions, derived from the network Grad-CAM heatmaps. DenseNet121 achieved AUROCs of 0.92 and 0.94 for hyper- and hypointense artifact detection, respectively, and weighted AUROCs of 0.85 and 0.88 for multiclass classification on single-slice high b-value diffusion-weighted images. A radiologist evaluated bounding box precision on a 1-5 Likert-like scale across 200 slices, achieving mean scores of 3.33±1.04 for hyperintense artifacts and 2.62±0.81 for hypointense artifacts. Hyper- and hypointense artifact detection in slice-wise breast DWI MRI dataset (b=1500 s/mm2) using CNNs particularly DenseNet121, seems promising and requires further validation.

[CV-64] Cross-Modal Mapping and Dual-Branch Reconstruction for 2D-3D Multimodal Industrial Anomaly Detection

【速读】:该论文旨在解决工业场景下多模态异常检测中因深度噪声、纹理弱或模态缺失导致现有无监督方法鲁棒性不足的问题。其核心解决方案是提出一种轻量且模态灵活的无监督框架CMDR-IAD,关键在于:通过双向2D↔3D跨模态映射建模外观-几何一致性,并采用双分支重建独立捕捉正常纹理与几何结构;进一步设计两阶段融合策略——可靠性门控映射异常用于突出空间一致性的纹理-几何差异,置信度加权重建异常则自适应平衡外观与几何偏差,从而实现稳定精确的异常定位,尤其在深度稀疏或低纹理区域表现优异。

链接: https://arxiv.org/abs/2603.03939
作者: Radia Daci,Vito Renò,Cosimo Patruno,Angelo Cardellicchio,Abdelmalik Taleb-Ahmed,Marco Leo,Cosimo Distante
机构: Consiglio Nazionale delle Ricerche (意大利国家研究委员会); Université de Haute-Alsace (上阿尔萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal industrial anomaly detection benefits from integrating RGB appearance with 3D surface geometry, yet existing \emphunsupervised approaches commonly rely on memory banks, teacher-student architectures, or fragile fusion schemes, limiting robustness under noisy depth, weak texture, or missing modalities. This paper introduces \textbfCMDR-IAD, a lightweight and modality-flexible unsupervised framework for reliable anomaly detection in 2D+3D multimodal as well as single-modality (2D-only or 3D-only) settings. \textbfCMDR-IAD combines bidirectional 2D \leftrightarrow 3D cross-modal mapping to model appearance-geometry consistency with dual-branch reconstruction that independently captures normal texture and geometric structure. A two-part fusion strategy integrates these cues: a reliability-gated mapping anomaly highlights spatially consistent texture-geometry discrepancies, while a confidence-weighted reconstruction anomaly adaptively balances appearance and geometric deviations, yielding stable and precise anomaly localization even in depth-sparse or low-texture regions. On the MVTec 3D-AD benchmark, CMDR-IAD achieves state-of-the-art performance while operating without memory banks, reaching 97.3% image-level AUROC (I-AUROC), 99.6% pixel-level AUROC (P-AUROC), and 97.6% AUPRO. On a real-world polyurethane cutting dataset, the 3D-only variant attains 92.6% I-AUROC and 92.5% P-AUROC, demonstrating strong effectiveness under practical industrial conditions. These results highlight the framework’s robustness, modality flexibility, and the effectiveness of the proposed fusion strategies for industrial visual inspection. Our source code is available at this https URL

[CV-65] DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping

【速读】:该论文旨在解决开放集语义映射(open-set semantic mapping)中当前基于实例的方法因依赖裁剪(crop-based)特征提取而存在的上下文缺失与计算开销大等问题。其核心解决方案是提出DISC(Dense Integrated Semantic Context),采用一种单次遍历、距离加权的特征提取机制,直接从视觉Transformer的中间层获取高保真CLIP嵌入(CLIP embeddings),从而避免传统裁剪带来的延迟和域偏移伪影,生成纯正且掩码对齐的语义表示;同时,DISC构建于全GPU加速架构之上,将周期性离线处理替换为实时的体素级实例精化,显著提升了大规模连续建图场景下的可扩展性与实时性能。

链接: https://arxiv.org/abs/2603.03935
作者: Felix Igelbrink,Lennart Niecksch,Martin Atzmueller,Joachim Hertzberg
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); Osnabrück University (奥斯纳布吕克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer’s intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at this https URL.

[CV-66] N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition

【速读】:该论文旨在解决基于Transformer的编码器-解码器网络在手写文本识别任务中,因目标语料库的语言分布与训练阶段所见源语料库存在偏移而导致性能显著下降的问题。其解决方案的关键在于提出一种外部n-gram注入(n-gram injection, NGI)机制,在推理阶段动态调整网络的语言建模能力,通过引入与目标语料库语言分布更接近的n-gram语言模型,从而在无需额外目标图像-文本对训练的情况下有效缓解语言分布偏移带来的识别准确率下降问题。该方法采用早期注入策略将n-gram信息融入Transformer解码器,使网络能充分利用纯文本数据,且仅需较低的n-gram推理开销即可实现性能提升。

链接: https://arxiv.org/abs/2603.03930
作者: Florent Meyer,Laurent Guichard,Denis Coquenet,Guillaume Gravier,Yann Soullard,Bertrand Coüasnon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based encoder-decoder networks have recently achieved impressive results in handwritten text recognition, partly thanks to their auto-regressive decoder which implicitly learns a language model. However, such networks suffer from a large performance drop when evaluated on a target corpus whose language distribution is shifted from the source text seen during training. To retain recognition accuracy despite this language shift, we propose an external n-gram injection (NGI) for dynamic adaptation of the network’s language modeling at inference time. Our method allows switching to an n-gram language model estimated on a corpus close to the target distribution, therefore mitigating bias without any extra training on target image-text pairs. We opt for an early injection of the n-gram into the transformer decoder so that the network learns to fully leverage text-only data at the low additional cost of n-gram inference. Experiments on three handwritten datasets demonstrate that the proposed NGI significantly reduces the performance gap between source and target corpora.

[CV-67] Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks CVPR2026

【速读】:该论文旨在解决图像美学评估(Image Aesthetic Assessment, IAA)中细粒度场景下的性能瓶颈问题,即现有主流IAA模型通常针对显著美学差异的粗粒度评估设计,在处理细微美学差异时判别能力不足。为应对这一挑战,作者构建了FGAesthetics数据集,包含32,217张图像和10,028个由自然图像、生成式AI内容(AIGC)及裁剪图像组成的系列,并通过成对比较进行标注以确保标签可靠性。其核心解决方案是提出FGAesQ框架,该框架基于相对排序学习,引入三个关键技术:差分保持标记化(Difference-preserved Tokenization, DiffToken)用于保留细粒度美学差异信息,对比文本辅助对齐(Comparative Text-assisted Alignment, CTAlign)增强跨模态语义一致性,以及排名感知回归(Rank-aware Regression, RankReg)实现高精度评分预测。该方法在细粒度评估任务中表现优异,同时保持了与现有方法相当的粗粒度评估性能。

链接: https://arxiv.org/abs/2603.03907
作者: Zhichao Yang,Jianjie Wang,Zhixianhe Zhang,Pangu Xie,Xiangfei Sheng,Pengfei Chen,Leida Li
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by CVPR 2026

点击查看摘要

Abstract:Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method.

[CV-68] Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications

【速读】:该论文旨在解决无人机(UAV)平台在复杂动态环境中进行目标跟踪时面临的挑战,包括平台运动引起的相机抖动、计算资源受限以及现有视觉跟踪算法在鲁棒性与实时性之间的权衡问题。其解决方案的关键在于提出一种模块化异步跟踪架构(Modular Asynchronous Tracking Architecture, MATA),该架构融合了基于Transformer的目标跟踪器与扩展卡尔曼滤波(Extended Kalman Filter),通过稀疏光流实现自运动补偿,并引入目标轨迹模型以提升稳定性;同时设计了一个面向嵌入式系统的硬件无关评估协议和新的量化指标——归一化失效时间(Normalized time to Failure, NT2F),从而更真实地反映跟踪器在资源受限环境下的持续性能表现。

链接: https://arxiv.org/abs/2603.03904
作者: Augustin Borne(ISL, Hochschule Karlsruhe – Technik und Wirtschaft Karlsruhe University of Applied Sciences, IRIMAS),Pierre Notin(ISL),Christophe Hennequin(ISL),Sebastien Changey(ISL),Stephane Bazeille(IRIMAS),Christophe Cudel(IRIMAS),Franz Quint
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object tracking from Unmanned Aerial Vehicles (UAVs) is challenged by platform dynamics, camera motion, and limited onboard resources. Existing visual trackers either lack robustness in complex scenarios or are too computationally demanding for real-time embedded use. We propose an Modular Asynchronous Tracking Architecture (MATA) that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model. We further introduce a hardware-independent, embedded oriented evaluation protocol and a new metric called Normalized time to Failure (NT2F) to quantify how long a tracker can sustain a tracking sequence without external help. Experiments on UAV benchmarks, including an augmented UAV123 dataset with synthetic occlusions, show consistent improvements in Success and NT2F metrics across multiple tracking processing frequency. A ROS 2 implementation on a Nvidia Jetson AGX Orin confirms that the evaluation protocol more closely matches real-time performance on embedded systems.

[CV-69] From Misclassifications to Outliers: Joint Reliability Assessment in Classification

【速读】:该论文旨在解决机器学习模型在真实场景中部署时的可靠性问题,即如何同时有效检测分布外(out-of-distribution, OOD)输入并预测分布内(in-distribution, ID)错误,从而提升分类器的整体可信度。传统方法通常将OOD检测与失败预测视为独立任务,忽略了二者之间的内在关联。论文提出一个统一的评估框架,引入双评分函数(double scoring functions, DS)的概念,并设计了新的量化指标DS-F1和DS-AURC来联合衡量两类任务的性能。关键创新在于通过双评分机制实现对ID样本的不确定性建模与OOD样本的区分能力协同优化,实验表明该方法显著优于单一评分策略;此外,基于此框架改进的SURE+模型在多种场景下均展现出更强的鲁棒性和可靠性,为构建可信赖的分类系统提供了新基准和实践指导。

链接: https://arxiv.org/abs/2603.03903
作者: Yang Li,Youyang Sha,Yinzhi Wang,Timothy Hospedales,Xi Shen,Shell Xu Hu,Xuanlong Yu
机构: Intellindust AI Lab(Intellindust人工智能实验室); Samsung AI Center(三星人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 3 figures. The source code is publicly available at this https URL

点击查看摘要

Abstract:Building reliable classifiers is a fundamental challenge for deploying machine learning in real-world applications. A reliable system should not only detect out-of-distribution (OOD) inputs but also anticipate in-distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end, we propose a unified evaluation framework that integrates OOD detection and failure prediction, quantified by our new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions. Experiments on the OpenOOD benchmark show that double scoring functions yield classifiers that are substantially more reliable than traditional single scoring approaches. Our analysis further reveals that OOD-based approaches provide notable gains under simple or far-OOD shifts, but only marginal benefits under more challenging near-OOD conditions. Beyond evaluation, we extend the reliable classifier SURE and introduce SURE+, a new approach that significantly improves reliability across diverse scenarios. Together, our framework, metrics, and method establish a new benchmark for trustworthy classification and offer practical guidance for deploying robust models in real-world settings. The source code is publicly available at this https URL.

[CV-70] A novel network for classification of cuneiform tablet metadata

【速读】:该论文旨在解决楔形文字泥板(cuneiform tablets)元数据分类问题,其核心挑战在于标注数据集规模有限且每块泥板以高分辨率点云(point-cloud)形式表示,导致传统方法难以有效处理。解决方案的关键在于提出一种受卷积启发的网络结构:首先通过逐步下采样点云并融合局部邻域信息来提取特征,随后在特征空间中计算邻居关系以引入全局上下文信息,从而在小样本条件下实现更优的分类性能。该方法相较于基于Transformer的Point-BERT模型展现出一致的优势。

链接: https://arxiv.org/abs/2603.03892
作者: Frederik Hagelskjær
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Point cloud, deep learning, cuneiform

点击查看摘要

Abstract:In this paper, we present a network structure for classifying metadata of cuneiform tablets. The problem is of practical importance, as the size of the existing corpus far exceeds the number of experts available to analyze it. But the task is made difficult by the combination of limited annotated datasets and the high-resolution point-cloud representation of each tablet. To address this, we develop a convolution-inspired architecture that gradually down-scales the point cloud while integrating local neighbor information. The final down-scaled point cloud is then processed by computing neighbors in the feature space to include global information. Our method is compared with the state-of-the-art transformer-based network Point-BERT, and consistently obtains the best performance. Source code and datasets will be released at publication.

[CV-71] UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

【速读】:该论文旨在解决当前唇同步(Lip synchronization)技术在真实场景下存在的两大核心问题:一是基于掩码的方法存在局部颜色不一致的问题,二是无掩码方法在全局背景纹理对齐上表现不佳;同时,现有方法难以应对多样化的现实挑战,如风格化虚拟形象、面部遮挡和极端光照条件。解决方案的关键在于提出一个统一框架UniSync,其核心创新包括:采用无掩码的姿态锚定训练策略(pose-anchored training strategy),以保留头部运动并消除合成颜色伪影;同时引入基于掩码的融合一致推理(mask-based blending consistent inference),确保结构精度与平滑过渡;此外,通过在紧凑但多样化视频数据上的微调,显著提升了模型的领域适应能力,从而有效处理复杂边界情况。

链接: https://arxiv.org/abs/2603.03882
作者: Ruidi Fan,Yang Zhou,Siyuan Wang,Tian Yu,Yutong Jiang,Xusheng Liu
机构: Mango TV (芒果TV)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.

[CV-72] Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

【速读】:该论文旨在解决从单张RGB图像中实现高精度、低延迟的6D位姿估计问题,这是机器人技术和扩展现实(Extended Reality, XR)应用中的关键挑战。现有基于多阶段的方法虽精度较高,但因计算复杂度大导致延迟高,难以满足实时性需求。其解决方案的关键在于提出一种新颖的单阶段端到端框架YOLO-Key-6D,通过在YOLO架构基础上引入辅助关键点回归头,直接预测物体3D边界框角点在图像中的2D投影,从而显著增强网络对三维几何的理解;同时采用连续9维旋转表示并通过奇异值分解(Singular Value Decomposition, SVD)映射到特殊正交群SO(3),确保训练稳定性和旋转参数的有效性。实验表明,该方法在LINEMOD和LINEMOD-Occluded数据集上分别达到96.24%和69.41%的ADD(-S) 0.1d准确率,并具备实时推理能力,实现了性能与效率的良好平衡。

链接: https://arxiv.org/abs/2603.03879
作者: Kemal Alperen Çetiner,Hazım Kemal Ekenel
机构: ASELSAN(ASELSAN); Istanbul Technical University (伊斯坦布尔技术大学); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VISAPP 2026

点击查看摘要

Abstract:Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object’s 3D bounding box corners. This keypoint detection task significantly improves the network’s understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.

[CV-73] Bridging Human Evaluation to Infrared and Visible Image Fusion

【速读】:该论文旨在解决红外与可见光图像融合(Infrared and Visible Image Fusion, IVIF)中因依赖手工设计损失函数和客观指标而导致的融合结果与人类视觉偏好不一致的问题,尤其在安全监控和驾驶辅助等以人为中心的应用场景中表现不佳。其关键解决方案是构建了一个大规模的人类反馈数据集(Human Feedback Dataset),包含多维度主观评分和伪影标注,并通过微调大语言模型进行专家评审以增强数据质量;在此基础上设计领域特定的奖励函数并训练奖励模型以量化感知质量,进而利用Group Relative Policy Optimization对融合网络进行微调,从而显著提升融合图像在人类审美上的表现,达到当前最优水平。

链接: https://arxiv.org/abs/2603.03871
作者: Jinyuan Liu,Xingyuan Li,Qingyun Mei,Haoyuan Xu,Zhiying Jiang,Long Ma,Risheng Liu,Xin Fan
机构: Dalian University of Technology (大连理工大学); Zhejiang University (浙江大学); Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune the fusion network through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics. Code is available at this https URL.

[CV-74] DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂或噪声环境中难以准确进行视觉定位与推理的问题,尤其在细粒度视觉理解任务中表现受限。其解决方案的关键在于提出了一种无需训练的框架DeepScan,通过三个核心模块协同实现:首先,**分层扫描(Hierarchical Scanning)**以自底向上方式探索局部线索并多尺度提取证据,有效缓解干扰背景的影响;其次,**重聚焦(Refocusing)**通过LVLM与视觉专家协作优化定位区域;最后,**证据增强推理(Evidence-Enhanced Reasoning)**利用混合证据记忆聚合多粒度视图,生成准确且可解释的答案。该方法显著提升了LVLM在多种视觉任务中的性能,且对不同架构和规模的模型均具普适性。

链接: https://arxiv.org/abs/2603.03857
作者: Yangfu Li,Hongjian Zhan,Jiawei Chen,Yuning Gong,Qi Liu,Yue Lu
机构: East China Normal University (华东师范大学); Sichuan University (四川大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages 17 figures

点击查看摘要

Abstract:Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.

[CV-75] All-in-One Image Restoration via Causal-Deconfounding Wavelet-Disentangled Prompt Network

【速读】:该论文旨在解决全合一图像恢复(All-in-One Image Restoration, AiOIR)模型中存在的两个关键问题:一是非退化语义特征与退化模式之间的虚假相关性,二是退化模式估计的偏差问题。为实现真正的因果关联建模,作者提出Causal-deconfounding Wavelet-disentangled Prompt Network (CWP-Net),其核心创新在于引入两个小波注意力模块(wavelet attention module)——分别位于编码器和解码器中,用于显式解耦退化特征与语义特征,从而消除虚假相关性;同时设计小波提示块(wavelet prompt block)生成替代变量以进行因果去混淆,缓解退化模式估计偏差,显著提升AiOIR模型的有效性和泛化能力。

链接: https://arxiv.org/abs/2603.03839
作者: Bingnan Wang,Bin Qin,Jiangmeng Li,Fanjiang Xu,Fuchun Sun,Hui Xiong
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Software, Chinese Academy of Science (中国科学院软件研究所); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); Beijing National Research Center for Information Science and Technology (北京信息科学与技术研究中心); Artificial Intelligence Thrust, The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)人工智能研究组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIP 2026

点击查看摘要

Abstract:Image restoration represents a promising approach for addressing the inherent defects of image content distortion. Standard image restoration approaches suffer from high storage cost and the requirement towards the known degradation pattern, including type and degree, which can barely be satisfied in dynamic practical scenarios. In contrast, all-in-one image restoration (AiOIR) eliminates multiple degradations within a unified model to circumvent the aforementioned issues. However, according to our causal analysis, we disclose that two significant defects still exacerbate the effectiveness and generalization of AiOIR models: 1) the spurious correlation between non-degradation semantic features and degradation patterns; 2) the biased estimation of degradation patterns. To obtain the true causation between degraded images and restored images, we propose Causal-deconfounding Wavelet-disentangled Prompt Network (CWP-Net) to perform effective AiOIR. CWP-Net introduces two modules for decoupling, i.e., wavelet attention module of encoder and wavelet attention module of decoder. These modules explicitly disentangle the degradation and semantic features to tackle the issue of spurious correlation. To address the issue stemming from the biased estimation of degradation patterns, CWP-Net leverages a wavelet prompt block to generate the alternative variable for causal deconfounding. Extensive experiments on two all-in-one settings prove the effectiveness and superior performance of our proposed CWP-Net over the state-of-the-art AiOIR methods.

[CV-76] Universal Pansharpening Foundation Model

【速读】:该论文旨在解决现有图像融合方法在多源卫星传感器和多样化场景下泛化能力差的问题,即当前的超分辨率多光谱(MS)图像生成方法通常依赖特定卫星平台和具体场景,限制了其在真实世界中的适用性。解决方案的关键在于提出一种通用的遥感图像融合基础模型 FoundPS,其核心创新包括:1)设计了一种模态交错的 Transformer 架构,通过学习波段特异性表示形成可逆的光谱仿射基底,将任意波段的 MS 图像映射到统一潜在空间;2)构建基于潜在扩散桥接的模型,利用后验采样机制将潜在空间演化与像素空间观测耦合,实现稳定且可控的融合过程;3)引入无限维像素到潜在空间的交互机制,全面捕捉 PAN 观测与 MS 表示之间的跨域依赖关系,从而促进互补信息的有效融合。

链接: https://arxiv.org/abs/2603.03831
作者: Hebaixu Wang,Jing Zhang,Haonan Guo,Di Wang,Jiayi Ma,Bo Du,Liangpei Zhang
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening generates the high-resolution multi-spectral (MS) image by integrating spatial details from a texture-rich panchromatic (PAN) image and spectral attributes from a low-resolution MS image. Existing methods are predominantly satellite-specific and scene-dependent, which severely limits their generalization across heterogeneous sensors and varied scenes, thereby reducing their real-world practicality. To address these challenges, we present FoundPS, a universal pansharpening foundation model for satellite-agnostic and scene-robust fusion. Specifically, we introduce a modality-interleaved transformer that learns band-wise modal specializations to form reversible spectral affine bases, mapping arbitrary-band MS into a unified latent space via tensor multiplication. Building upon this, we construct a latent diffusion bridge model to progressively evolve latent representations, and incorporate bridge posterior sampling to couple latent diffusion with pixel-space observations, enabling stable and controllable fusion. Furthermore, we devise infinite-dimensional pixel-to-latent interaction mechanisms to comprehensively capture the cross-domain dependencies between PAN observations and MS representations, thereby facilitating complementary information fusion. In addition, to support large-scale training and evaluation, we construct a comprehensive pansharpening benchmark, termed PSBench, consisting of worldwide MS and PAN image pairs from multiple satellites across diverse scenes. Extensive experiments demonstrate that FoundPS consistently outperforms state-of-the-art methods, exhibiting superior generalization and robustness across a wide range of pansharpening tasks.

[CV-77] From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning ICLR2026

【速读】:该论文旨在解决多模态大模型(Multimodal Large Reasoning Models, MLRMs)在冷启动初始化阶段注意力机制分配不合理的问题,特别是视觉信息未能有效激活导致推理性能受限的瓶颈。其核心发现是:视觉注意力得分(Visual Attention Score, VAS)与模型多模态推理能力高度相关(r=0.9616),但传统的多模态冷启动策略无法提升VAS,反而表现出“懒惰的注意力定位”(Lazy Attention Localization)现象——即注意力分布接近纯文本基线模型。解决方案的关键在于提出Attention-Guided Visual Anchoring and Reflection (AVAR) 框架,通过三个协同模块实现:1)基于视觉锚点的数据合成增强视觉感知;2)注意力引导的目标函数优化注意力分配;3)视觉锚定的奖励塑形强化视觉线索利用。该方法无需重新训练即可显著提升模型性能(+1–2%),并在Qwen2.5-VL-7B上实现跨7个基准测试平均提升7.0%,验证了其有效性与可扩展性。

链接: https://arxiv.org/abs/2603.03825
作者: Ruilin Luo,Chufan Shi,Yizhen Zhang,Cheng Yang,Songtao Jiang,Tongkun Guan,Ruizhe Chen,Ruihang Chu,Peng Wang,Mingkun Yang,Yujiu Yang,Junyang Lin,Zhibo Yang
机构: Tsinghua University (清华大学); University of South California (南加州大学); Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团); University of California San Diego (加州大学圣地亚哥分校); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Poster

点击查看摘要

Abstract:The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1 - 2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at this https URL.

[CV-78] Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning

【速读】:该论文旨在解决开放词汇组合零样本学习(Open-Vocabulary Compositional Zero-Shot Learning, OV-CZSL)中模型难以泛化到未见属性与对象及其组合的问题。现有基于提示调优(prompt tuning)的方法在封闭设置下表现良好,但在开放设置下受限于仅能处理已见概念,无法有效迁移至未见语义空间。解决方案的关键在于利用嵌入空间中语义相关属性或对象形成的局部结构一致性,提出结构感知提示自适应(Structure-aware Prompt Adaptation, SPA)方法:训练阶段设计结构一致性损失(Structure-aware Consistency Loss, SCL)以保持已见概念的局部结构稳定;推理阶段采用结构引导的自适应策略(Structure-guided Adaptation Strategy, SAS),将未见概念的结构对齐至语义相似的已见概念结构,从而实现从已见到未见概念的有效泛化。SPA可无缝集成至现有CZSL提示调优方法,兼具良好的闭集性能与显著提升的开集表现。

链接: https://arxiv.org/abs/2603.03815
作者: Yihang Duan,Jiong Wang,Pengpeng Zeng,Ji Zhang,Lei Zhao,Chong Wang,Jingkuan Song,Lianli Gao
机构: University of Electronic Science and Technology of China (电子科技大学); Ningbo University (宁波大学); Southwest Jiaotong University (西南交通大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in the open-vocabulary setting, where compositions of both seen and unseen attributes and objects are evaluated. Recently, prompt tuning methods have demonstrated strong generalization capabilities in the closed setting, where only compositions of seen attributes and objects are evaluated, i.e., Compositional Zero-Shot Learning (CZSL). However, directly applying these methods to OV-CZSL may not be sufficient to generalize to unseen attributes, objects and their compositions, as it is limited to seen attributes and objects. Normally, when faced with unseen concepts, humans adopt analogies with seen concepts that have the similar semantics thereby inferring their meaning (e.g., “wet” and “damp”, “shirt” and “jacket”). In this paper, we experimentally show that the distribution of semantically related attributes or objects tends to form consistent local structures in the embedding space. Based on the above structures, we propose Structure-aware Prompt Adaptation (SPA) method, which enables models to generalize from seen to unseen attributes and objects. Specifically, in the training stage, we design a Structure-aware Consistency Loss (SCL) that encourages the local structure’s consistency of seen attributes and objects in each iteration. In the inference stage, we devise a Structure-guided Adaptation Strategy (SAS) that adaptively aligns the structures of unseen attributes and objects with those of trained seen attributes and objects with similar semantics. Notably, SPA is a plug-and-play method that can be seamlessly integrated into existing CZSL prompt tuning methods. Extensive experiments on OV-CZSL benchmarks demonstrate that SPA achieves competitive closed-set performance while significantly improving open-vocabulary results.

[CV-79] Vector-Quantized Soft Label Compression for Dataset Distillation

【速读】:该论文旨在解决数据集蒸馏(Dataset Distillation)中软标签(soft labels)带来的存储开销问题,尤其是在大规模分类任务(如ImageNet-1K)中,每个合成样本通常关联多个经过增强的软标签,导致软标签成为存储成本的主要来源。解决方案的关键在于提出一种向量量化自编码器(Vector-Quantized Autoencoder, VQAE),用于高效压缩软标签,从而在不显著损失蒸馏性能的前提下实现高达30–40倍的额外压缩比,优于RDED、LPLD、SRE2L和CDA等基线方法。

链接: https://arxiv.org/abs/2603.03808
作者: Ali Abbasi,Ashkan Shahbazi,Hamed Pirsiavash,Soheil Kolouri
机构: Vanderbilt University (范德比尔特大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation is an emerging technique for reducing the computational and storage costs of training machine learning models by synthesizing a small, informative subset of data that captures the essential characteristics of a much larger dataset. Recent methods pair synthetic samples and their augmentations with soft labels from a teacher model, enabling student models to generalize effectively despite the small size of the distilled dataset. While soft labels are critical for effective distillation, the storage and communication overhead they incur, especially when accounting for augmentations, is often overlooked. In practice, each distilled sample is associated with multiple soft labels, making them the dominant contributor to storage costs, particularly in large-class settings such as ImageNet-1K. In this paper, we present a rigorous analysis of bit requirements across dataset distillation frameworks, quantifying the storage demands of both distilled samples and their soft labels. To address the overhead, we introduce a vector-quantized autoencoder (VQAE) for compressing soft labels, achieving substantial compression while preserving the effectiveness of the distilled data. We validate our method on both vision and language distillation benchmarks. On ImageNet-1K, our proposed VQAE achieves 30–40x additional compression over RDED, LPLD, SRE2L, and CDA baselines while retaining over 90% of their original performance.

[CV-80] Adaptive Enhancement and Dual-Pooling Sequential Attention for Lightweight Underwater Object Detection with YOLOv10

【速读】:该论文旨在解决水下目标检测中因光吸收、散射和对比度下降等视觉退化现象导致的性能瓶颈问题。其解决方案的关键在于提出一个基于YOLOv10架构的轻量化且鲁棒的框架,核心创新包括:(1)多阶段自适应增强模块以提升水下图像质量;(2)嵌入骨干网络中的双池化顺序注意力机制(Dual-Pooling Sequential Attention, DPSA),强化多尺度特征表示能力;(3)引入焦点广义交并比物体性损失(Focal Generalized IoU Objectness, FGIoU),在类别不平衡条件下协同优化定位精度与物体性预测。实验表明,该方法在RUOD和DUO数据集上分别实现88.9%和88.0%的mAP(IoU=0.5),较基线YOLOv10n提升6.7%和6.2%,同时保持仅2.8M参数的紧凑结构,实现了精度、鲁棒性与实时性的有效平衡。

链接: https://arxiv.org/abs/2603.03807
作者: Md. Mushibur Rahman,Umme Fawzia Rahim,Enam Ahmed Taufik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN)

点击查看摘要

Abstract:Underwater object detection constitutes a pivotal endeavor within the realms of marine surveillance and autonomous underwater systems; however, it presents significant challenges due to pronounced visual impairments arising from phenomena such as light absorption, scattering, and diminished contrast. In response to these formidable challenges, this manuscript introduces a streamlined yet robust framework for underwater object detection, grounded in the YOLOv10 architecture. The proposed method integrates a Multi-Stage Adaptive Enhancement module to improve image quality, a Dual-Pooling Sequential Attention (DPSA) mechanism embedded into the backbone to strengthen multi-scale feature representation, and a Focal Generalized IoU Objectness (FGIoU) loss to jointly improve localization accuracy and objectness prediction under class imbalance. Comprehensive experimental evaluations conducted on the RUOD and DUO benchmark datasets substantiate that the proposed DPSA_FGIoU_YOLOv10n attains exceptional performance, achieving mean Average Precision (mAP) scores of 88.9% and 88.0% at IoU threshold 0.5, respectively. In comparison to the baseline YOLOv10n, this represents enhancements of 6.7% for RUOD and 6.2% for DUO, all while preserving a compact model architecture comprising merely 2.8M parameters. These findings validate that the proposed framework establishes an efficacious equilibrium among accuracy, robustness, and real-time operational efficiency, making it suitable for deployment in resource-constrained underwater settings.

[CV-81] Separators in Enhancing Autoregressive Pretraining for Vision Mamba

【速读】:该论文旨在解决当前自回归预训练方法在视觉Mamba(Vision Mamba)模型中仅适用于短序列任务的问题,从而无法充分发挥其处理长序列的优势。解决方案的关键在于提出一种名为STAR(SeparaTors for AutoRegressive pretraining)的新方法,通过在每张图像前插入相同的分隔符(separator),明确区分不同图像的起始位置,从而实现输入序列长度的四倍扩展,同时保持原始图像维度不变。这一设计使得模型能够有效利用长程依赖关系,在ImageNet-1k上达到83.5%的准确率,显著提升了视觉Mamba的性能。

链接: https://arxiv.org/abs/2603.03806
作者: Hanpeng Liu,Zidan Wang,Shuoxi Zhang,Kaiyuan Gao,Kun He
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba’s inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba’s prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbfSepara\textbfTors for \textbfAuto\textbfRegressive pretraining to demarcate and differentiate between different images, known as \textbfSTAR. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.

[CV-82] When and Where to Reset Matters for Long-Term Test-Time Adaptation ICLR2026

【速读】:该论文旨在解决持续测试时自适应(Continual Test-Time Adaptation, TTA)在长期运行中因误差累积导致的模型坍塌(model collapse)问题,即模型逐渐退化为仅对所有输入预测少数几个类别。传统方法采用周期性完全重置策略来清除累积误差,但存在两个关键缺陷:一是重置时机与实际坍塌风险无关,导致适应效果次优;二是全量重置会灾难性地丢失长期积累的有效知识,而这些知识在未来可能仍具价值。为此,作者提出三项核心解决方案:(1)自适应选择性重置(Adaptive and Selective Reset, ASR)机制,动态判断何时及何处进行局部重置以最小化干扰;(2)基于重要性的正则化项(importance-aware regularizer),用于恢复因重置而丢失的关键知识;(3)在线调整适应策略(on-the-fly adaptation adjustment),提升在极端领域偏移下的适应能力。实验表明,该方案在长周期TTA基准上显著优于现有方法,尤其在挑战性场景下表现突出。

链接: https://arxiv.org/abs/2603.03796
作者: Taejun Lim,Joong-Won Hwang,Kibok Lee
机构: Yonsei University (延世大学); ETRI (电子和电信研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:When continual test-time adaptation (TTA) persists over the long term, errors accumulate in the model and further cause it to predict only a few classes for all inputs, a phenomenon known as model collapse. Recent studies have explored reset strategies that completely erase these accumulated errors. However, their periodic resets lead to suboptimal adaptation, as they occur independently of the actual risk of collapse. Moreover, their full resets cause catastrophic loss of knowledge acquired over time, even though such knowledge could be beneficial in the future. To this end, we propose (1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, (2) an importance-aware regularizer to recover essential knowledge lost due to reset, and (3) an on-the-fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive experiments across long-term TTA benchmarks demonstrate the effectiveness of our approach, particularly under challenging conditions. Our code is available at this https URL.

[CV-83] AP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration CVPR2026

【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理阶段速度缓慢的问题,其核心瓶颈在于需要反复进行全模型去噪过程。解决方案的关键在于提出一种无需训练、基于探测的自适应选择机制——Token-Adaptive Predictor (TAP),该方法通过一次前向传播计算第一层输出作为低成本探测信号,进而估算一组候选预测器(主要基于不同阶数和时域范围的泰勒展开)的代理损失,并为每个token动态分配误差最小的预测器。此“探测后选择”策略利用了token间异质的时间动态特性,在几乎无额外开销的前提下显著提升推理效率,同时保持生成质量接近原模型水平。

链接: https://arxiv.org/abs/2603.03792
作者: Haowei Zhu,Tingxuan Huang,Xing Wang,Tianyu Zhao,Jiexi Wang,Weifeng Chen,Xurui Peng,Fangmin Chen,Junhai Yong,Bin Wang
机构: Tsinghua University (清华大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model’s first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token “probe-then-select” strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy-efficiency frontier compared to fixed global predictors and caching-only baselines.

[CV-84] Small Object Detection in Complex Backgrounds with Multi-Scale Attention and Global Relation Modeling

【速读】:该论文旨在解决复杂背景下小目标检测中存在的特征退化、语义表征弱化及定位不准等问题,这些问题主要由下采样操作和背景干扰引起。其核心解决方案包括:引入残差哈尔小波下采样模块(Residual Haar Wavelet Downsampling module),通过联合利用空域卷积特征与频域表示来保留细粒度结构细节;设计全局关系建模模块(Global Relation Modeling module)以捕捉高层特征阶段的长程依赖关系,增强语义感知并抑制背景噪声;提出跨尺度混合注意力模块(Cross-Scale Hybrid Attention module),实现多尺度特征间的稀疏且对齐交互,从而高效融合高分辨率细节与高层语义信息;同时引入中心辅助损失(Center-Assisted Loss)稳定训练过程并提升小目标定位精度。上述组件共同构成了一个面向小目标检测的多级特征增强与全局关系建模框架,显著提升了检测性能。

链接: https://arxiv.org/abs/2603.03788
作者: Wenguang Tao,Xiaotian Wang,Tian Yan,Yi Wang,Jie Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object detection under complex backgrounds remains a challenging task due to severe feature degradation, weak semantic representation, and inaccurate localization caused by downsampling operations and background interference. Existing detection frameworks are mainly designed for general objects and often fail to explicitly address the unique characteristics of small objects, such as limited structural cues and strong sensitivity to localization errors. In this paper, we propose a multi-level feature enhancement and global relation modeling framework tailored for small object detection. Specifically, a Residual Haar Wavelet Downsampling module is introduced to preserve fine-grained structural details by jointly exploiting spatial-domain convolutional features and frequency-domain representations. To enhance global semantic awareness and suppress background noise, a Global Relation Modeling module is employed to capture long-range dependencies at high-level feature stages. Furthermore, a Cross-Scale Hybrid Attention module is designed to establish sparse and aligned interactions across multi-scale features, enabling effective fusion of high-resolution details and high-level semantic information with reduced computational overhead. Finally, a Center-Assisted Loss is incorporated to stabilize training and improve localization accuracy for small objects. Extensive experiments conducted on the large-scale RGBT-Tiny benchmark demonstrate that the proposed method consistently outperforms existing state-of-the-art detectors under both IoU-based and scale-adaptive evaluation metrics. These results validate the effectiveness and robustness of the proposed framework for small object detection in complex environments.

[CV-85] IntroductionDMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement

【速读】:该论文旨在解决超低场(64 mT)脑部磁共振成像(MRI)因图像质量较低而限制临床应用的问题,尤其是在缺乏配对64 mT与3 T扫描数据的情况下,如何实现从64 mT到3 T的高质量图像转换。其解决方案的关键在于提出一种无配对的64 mT → 3 T图像翻译框架,基于未配对神经薛定谔桥(Unpaired Neural Schrödinger Bridge, UNSB),引入多步精炼机制,并通过两个核心改进增强现实性和结构保真度:一是利用冻结的3 T扩散教师模型,结合DMD2风格的扩散引导分布匹配来强化目标分布对齐;二是融合PatchNCE与解剖结构保持(Anatomical Structure Preservation, ASP)正则化项,显式约束全局结构一致性,包括软前景-背景一致性与边界感知约束,从而在无配对基准上提升分布级真实性,在配对队列上显著增强结构保真度。

链接: https://arxiv.org/abs/2603.03769
作者: Youngmin Kim,Jaeyun Shin,Jeongchan Kim,Taehoon Lee,Jaemin Kim,Peter Hsu,Jelle Veraart,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ultra Low Field (64 mT) brain MRI improves accessibility but suffers from reduced image quality compared to 3 T. As paired 64 mT - 3 T scans are scarce, we propose an unpaired 64 mT \rightarrow 3 T translation framework that enhances realism while preserving anatomy. Our method builds upon the Unpaired Neural Schrödinge Bridge (UNSB) with multi-step refinement. To strengthen target distribution alignment, we augment the adversarial objective with DMD2-style diffusion-guided distribution matching using a frozen 3T diffusion teacher. To explicitly constrain global structure beyond patch-level correspondence, we combine PatchNCE with an Anatomical Structure Preservation (ASP) regularizer that enforces soft foreground background consistency and boundary aware constraints. Evaluated on two disjoint cohorts, the proposed framework achieves an improved realism structure trade-off, enhancing distribution level realism on unpaired benchmarks while increasing structural fidelity on the paired cohort compared to unpaired baselines.

[CV-86] LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

【速读】:该论文旨在解决自动驾驶感知与仿真中深度估计的三大挑战:高精度度量深度(metric depth)难以实现、多视角与时间一致性不足,以及跨域泛化能力弱的问题。其解决方案的核心在于提出DriveMVS框架,通过两个关键创新实现多目标协同优化:一是利用稀疏但度量准确的激光雷达(LiDAR)观测作为几何提示(geometric prompt),锚定深度估计的绝对尺度;二是引入三线索融合机制(triple-cue combiner)对多样特征进行深层融合以缓解歧义并提升鲁棒性,同时采用时空解码器(spatio-temporal decoder)联合利用多视图几何信息与相邻帧的时间上下文,保障时序一致性。此设计使模型在多个基准上达到最优性能,尤其在度量准确性、时序稳定性及零样本跨域迁移方面表现突出。

链接: https://arxiv.org/abs/2603.03765
作者: Qihao Sun,Jiarun Liu,Ziqian Ni,Jianyun Xu,Tao Xie,Lijun Zhao,Ruifeng Li,Sheng Yang
机构: CaiNiao Inc., Alibaba Group; Harbin Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization. To address these challenges, we present DriveMVS, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames. Built upon these principles, DriveMVS embeds the LiDAR prompt in two ways: as a hard geometric prior that anchors the cost volume, and as soft feature-wise guidance fused by a triple-cue combiner. Regarding temporal consistency, DriveMVS employs a spatio-temporal decoder that jointly leverages geometric cues from the MVS cost volume and temporal context from neighboring frames. Experiments show that DriveMVS achieves state-of-the-art performance on multiple benchmarks, excelling in metric accuracy, temporal stability, and zero-shot cross-domain transfer, demonstrating its practical value for scalable, reliable autonomous driving systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.03765 [cs.CV] (or arXiv:2603.03765v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.03765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-87] Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

【速读】:该论文旨在解决细粒度视觉理解从静态分类向知识增强推理转变过程中存在的局限性,即现有方法受限于封闭集分类体系和单标签预测,在开放集或依赖上下文的场景下性能显著下降。解决方案的关键在于提出一种统一框架——知识增强细粒度推理代理(Knowledge-Augmented Fine-Grained Reasoning Agent, KFRA),其通过三阶段闭环推理机制实现证据驱动的可解释推理:首先进行开放词汇检测与网络规模知识检索以生成类别假设;其次利用全局到局部聚焦机制将文本知识与视觉证据对齐,定位判别区域;最后在大型多模态模型中整合多模态证据完成推理。KFRA创新性地建立检索-接地耦合机制,使检索到的知识转化为空间上具象化的验证证据,从而在多种细粒度场景下实现事实准确、可解释且任务无关的推理能力。

链接: https://arxiv.org/abs/2603.03762
作者: Junhan Chen,Zilu Zhou,Yujun Tong,Dongliang Chang,Yitao Luo,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.

[CV-88] WSI-INR: Implicit Neural Representations for Lesion Segmentation in Whole-Slide Images

【速读】:该论文旨在解决全切片图像(Whole-slide images, WSIs)在计算病理学中进行病灶分割时存在的空间连续性破坏与多分辨率适应性差的问题。现有方法将WSI划分为离散块,导致分割结果空间碎片化,并且不同分辨率视图被当作独立样本处理,从而降低了模型对分辨率变化的鲁棒性。其解决方案的关键在于提出一种基于隐式神经表示(Implicit Neural Representations, INRs)的无块划分框架——WSI-INR,该框架将WSI建模为一个从空间坐标到组织语义特征的连续隐函数,通过引入多分辨率哈希网格编码(multi-resolution hash grid encoding),将不同分辨率视为同一连续组织的不同采样密度,实现跨分辨率的一致特征表示;同时,通过共享INR解码器联合训练,捕获不同病例间的通用先验信息,显著提升了分割精度与鲁棒性。

链接: https://arxiv.org/abs/2603.03749
作者: Yunheng Wu,Wenqi Huang,Liangyi Wang,Masahiro Oda,Yuichiro Hayashi,Daniel Rueckert,Kensaku Mori
机构: Nagoya University (名古屋大学); JST Moonshot RD (日本科学技术振兴机构月球计划研发项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 page, 4 figures

点击查看摘要

Abstract:Whole-slide images (WSIs) are fundamental for computational pathology, where accurate lesion segmentation is critical for clinical decision making. Existing methods partition WSIs into discrete patches, disrupting spatial continuity and treating multi-resolution views as independent samples, which leads to spatially fragmented segmentation and reduced robustness to resolution variations. To address the issues, we propose WSI-INR, a novel patch-free framework based on Implicit Neural Representations (INRs). WSI-INR models the WSI as a continuous implicit function mapping spatial coordinates directly to tissue semantics features, outputting segmentation results while preserving intrinsic spatial information across the entire slide. In the WSI-INR, we incorporate multi-resolution hash grid encoding to regard different resolution levels as varying sampling densities of the same continuous tissue, achieving a consistent feature representation across resolutions. In addition, by jointly training a shared INR decoder, WSI-INR can capture general priors across different cases. Experimental results showed that WSI-INR maintains robust segmentation performance across resolutions; at Base/4, our resolution-specific optimization improves Dice score by +26.11%, while U-Net and TransUNet decrease by 54.28% and 36.18%, respectively. Crucially, this work enables INRs to segment highly heterogeneous pathological lesions beyond structurally consistent anatomical tissues, offering a fresh perspective for pathological analysis.

[CV-89] DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation CVPR2026

【速读】:该论文旨在解决从非定标多视角视频输入中准确估计视图一致的几何结构与相机位姿的问题,尤其在高空间分辨率和长序列场景下仍保持精度与效率。其解决方案的关键在于提出了一种双流Transformer架构(DAGE),通过解耦全局一致性与细节信息:低分辨率流利用交替帧/全局注意力机制高效构建视图一致表示并估计相机位姿;高分辨率流则逐帧处理原始图像以保留锐利边界与小尺度结构;两者通过轻量级交叉注意力适配器融合,将全局上下文注入预训练单帧路径而不破坏其性能。该设计实现了分辨率与片段长度的独立扩展,支持最高2K输入,并维持实用的推理成本,显著提升了视频几何估计与多视角重建的性能,达到了当前最优水平。

链接: https://arxiv.org/abs/2603.03744
作者: Tuan Duc Ngo,Jiahui Huang,Seoung Wug Oh,Kevin Blackburn-Matzen,Evangelos Kalogerakis,Chuang Gan,Joon-Young Lee
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

[CV-90] PROSPECT: Unified Streaming Vision-Language Navigation via Semantic–Spatial Fusion and Latent Predictive Representation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在零样本端到端视觉-语言导航(Vision-Language Navigation, VLN)任务中,仅依赖语义理解而缺乏对环境动态和空间结构的预测建模问题,从而导致长距离导航鲁棒性不足。解决方案的关键在于提出PROSPECT——一个统一的流式导航代理,其核心是将流式视觉-语言-动作(Vision-Language-Action, VLA)策略与潜在预测表示学习相结合:通过CUT3R作为流式3D基础空间编码器生成具有绝对尺度的长上下文空间特征,并与SigLIP语义特征通过交叉注意力融合;训练时引入可学习的流查询令牌(stream query tokens),以预测下一步的2D和3D潜在特征(而非像素或显式模态),并在冻结的SigLIP和CUT3R教师模型的潜在空间中进行监督,从而在不增加推理开销的前提下优化内部表征。

链接: https://arxiv.org/abs/2603.03739
作者: Zehua Fan,Wenqi Lyu,Wenxuan Song,Linge Zhao,Yifei Yang,Xi Wang,Junjie He,Lida Huang,Haiyan Liu,Bingchuan Sun,Guangjun Bao,Xuanyao Mao,Liang Xu,Yan Wang,Feng Gao
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); University of Adelaide (阿德莱德大学); Wuhan University (武汉大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Beijing Jiaotong University (北京交通大学); AIR Wuxi Innovation Center, Tsinghua University (清华大学无锡创新中心); Lenovo (联想)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal representations without inference overhead. Experiments on VLN-CE benchmarks and real-robot deployment demonstrate state-of-the-art performance and improved long-horizon robustness under diverse lighting. We will release code for the community soon.

[CV-91] QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment CVPR2026

【速读】:该论文旨在解决无参考点云质量评估(No-Reference Point Cloud Quality Assessment, NR-PCQA)中因标注数据稀缺导致的泛化能力不足问题。现有方法难以有效迁移图像域中的感知质量先验至点云域,主要受限于对感知质量关键特性(如质量排序敏感性和质量感知特征对齐)的忽视。解决方案的关键在于提出一种新颖的质量感知域适应框架QD-PCQA,其核心创新包括:i)基于排名加权的条件对齐(Rank-weighted Conditional Alignment, RCA)策略,通过在一致质量水平下对齐特征并自适应强化误排序样本,增强对感知质量排序的敏感性;ii)质量引导的特征增强(Quality-guided Feature Augmentation, QFA)策略,结合质量引导的风格混叠、多层扩展与双域增强模块,提升感知特征对齐的鲁棒性与判别力。

链接: https://arxiv.org/abs/2603.03726
作者: Guohua Zhang,Jian Jin,Meiqin Liu,Chao Yao,Weisi Lin
机构: Beijing Jiaotong University (北京交通大学); Nanyang Technological University (南洋理工大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks. The code is available at this https URL.

[CV-92] Glass Segmentation with Fusion of Learned and General Visual Features

【速读】:该论文旨在解决从RGB图像中对玻璃表面进行精准分割的问题,这一任务因玻璃作为透明材料缺乏显著视觉特征而极具挑战性。然而,准确识别玻璃表面对于场景理解与机器人导航至关重要,因其必须被判定为固体材质。解决方案的关键在于提出一种双主干(dual-backbone)架构:其中一主干采用冻结的DINOv3视觉基础模型提取通用视觉特征,另一主干则利用监督训练的Swin Transformer模型学习任务特定特征;随后通过残差Squeeze-and-Excitation通道缩减模块对多尺度特征进行下采样,并输入Mask2Former解码器生成最终分割掩膜。该方法在多个公开玻璃分割数据集上实现了最先进的性能,且在轻量化DINOv3变体下推理速度优于当前最优方法。

链接: https://arxiv.org/abs/2603.03718
作者: Risto Ojala,Tristan Ellison,Mo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: this https URL

[CV-93] LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

【速读】:该论文旨在解决局部差分隐私(Local Differential Privacy, LDP)在图像数据应用中因高维像素空间导致的性能退化问题。传统LDP机制针对低维数据设计,在处理高维图像时会造成显著的效用损失,这并非LDP本身的局限,而是由于其直接应用于不合适的像素级表示所致。解决方案的关键在于提出LDP-Slicing框架,通过将像素值分解为二进制位平面(bit-planes)实现对图像数据的位级隐私保护;同时引入感知混淆模块以减少人类可感知的信息泄露,并采用基于优化的隐私预算分配策略,在满足严格的像素级ε-LDP条件下有效保留下游任务所需的图像效用。

链接: https://arxiv.org/abs/2603.03711
作者: Yuanming Cao,Chengqi Li,Wenbo He
机构: McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level \varepsilon -LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

[CV-94] MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction

【速读】:该论文旨在解决零样本磁共振成像(MRI)重建中因单一模态无条件先验模型在严重欠定条件下产生幻觉(hallucinations)的问题,尤其在临床实践中常存在高质量结构扫描等辅助模态信息却未被有效利用的现状。其解决方案的关键在于提出MPFlow框架,该框架基于校正流(rectified flow)构建,无需重新训练生成先验即可在推理阶段引入辅助MRI模态信息以提升解剖学保真度;通过提出的自监督预训练策略PAMRI(Patch-level Multi-modal MR Image Pretraining)学习跨模态共享表示,并在采样过程中联合施加数据一致性约束与跨模态特征对齐机制,从而系统性抑制内在和外在幻觉。实验表明,MPFlow在仅使用扩散基线20%采样步数的情况下保持图像质量并显著降低肿瘤区域幻觉(分割Dice分数提升超15%),验证了跨模态引导的有效性与可靠性。

链接: https://arxiv.org/abs/2603.03710
作者: Seunghoi Kim,Chen Jin,Henry F. J. Tregidgo,Matteo Figini,Daniel C. Alexander
机构: University College London (伦敦大学学院); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot MRI reconstruction relies on generative priors, but single-modality unconditional priors produce hallucinations under severe ill-posedness. In many clinical workflows, complementary MRI acquisitions (e.g. high-quality structural scans) are routinely available, yet existing reconstruction methods lack mechanisms to leverage this additional information. We propose MPFlow, a zero-shot multi-modal reconstruction framework built on rectified flow that incorporates auxiliary MRI modalities at inference time without retraining the generative prior to improve anatomical fidelity. Cross-modal guidance is enabled by our proposed self-supervised pretraining strategy, Patch-level Multi-modal MR Image Pretraining (PAMRI), which learns shared representations across modalities. Sampling is jointly guided by data consistency and cross-modal feature alignment using pre-trained PAMRI, systematically suppressing intrinsic and extrinsic hallucinations. Extensive experiments on HCP and BraTS show that MPFlow matches diffusion baselines on image quality using only 20% of sampling steps while reducing tumor hallucinations by more than 15% (segmentation dice score). This demonstrates that cross-modal guidance enables more reliable and efficient zero-shot MRI reconstruction.

[CV-95] Error as Signal: Stiffness-Aware Diffusion Sampling via Embedded Runge-Kutta Guidance ICLR2026

【速读】:该论文旨在解决扩散模型中由于求解器引入的局部截断误差(Local Truncation Error, LTE)在刚性区域(stiff regions)导致的样本质量下降问题。现有方法如自引导(Autoguidance, AG)依赖辅助网络且未有效处理solver-induced errors,而本文的关键创新在于观察到这些误差与主导特征向量(dominant eigenvector)方向一致,从而提出嵌入式Runge-Kutta引导(Embedded Runge-Kutta Guidance, ERK-Guid),通过检测刚性并利用求解器误差作为引导信号来降低LTE并稳定采样过程。理论与实证分析表明,ERK-Guid在合成数据集和ImageNet上均显著优于当前最优方法。

链接: https://arxiv.org/abs/2603.03692
作者: Inho Kong,Sojin Lee,Youngjoon Hong,Hyunwoo J. Kim
机构: Korea University (韩国科学技术院); KAIST (韩国科学技术院); Seoul National University (首尔国立大学); Korea Institute for Advanced Study (韩国高等科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) has established the foundation for guidance mechanisms in diffusion models, showing that well-designed guidance proxies significantly improve conditional generation and sample quality. Autoguidance (AG) has extended this idea, but it relies on an auxiliary network and leaves solver-induced errors unaddressed. In stiff regions, the ODE trajectory changes sharply, where local truncation error (LTE) becomes a critical factor that deteriorates sample quality. Our key observation is that these errors align with the dominant eigenvector, motivating us to leverage the solver-induced error as a guidance signal. We propose Embedded Runge-Kutta Guidance (ERK-Guid), which exploits detected stiffness to reduce LTE and stabilize sampling. We theoretically and empirically analyze stiffness and eigenvector estimators with solver errors to motivate the design of ERK-Guid. Our experiments on both synthetic datasets and the popular benchmark dataset, ImageNet, demonstrate that ERK-Guid consistently outperforms state-of-the-art methods. Code is available at this https URL.

[CV-96] EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景下推理效率低下的问题,尤其是由高分辨率图像和视频带来的视觉标记(visual tokens)数量指数级增长所导致的计算开销。现有视觉标记剪枝方法主要在视觉编码之后进行,忽略了编码阶段本身产生的巨大计算成本。解决方案的关键在于提出EvoPrune,一种在视觉编码早期阶段直接执行剪枝的方法,其采用分层剪枝策略,基于标记相似性、多样性及注意力重要性来保留关键视觉信息,从而显著提升推理速度并最小化性能损失。

链接: https://arxiv.org/abs/2603.03681
作者: Yuhao Chen,Bin Shan,Xin Ye,Cheng Chen
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2 \times inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

[CV-97] Machine Pareidolia: Protecting Facial Image with Emotional Editing AAAI

【速读】:该论文旨在解决面部识别(Facial Recognition, FR)系统在数字环境中引发的隐私泄露问题,特别是针对恶意使用FR模型所带来的威胁。传统防护方法如妆容风格迁移(makeup style transfer)在黑盒场景下迁移能力弱,且对不同人群(如男性及肤色较深个体)适用性有限。解决方案的关键在于提出一种名为MAP的新方法,其核心是通过人类情绪变化来伪装原始身份为目标身份;该方法独特地微调一个得分网络(score network),同时学习目标身份和人类表情两个目标,并通过梯度投影联合优化以收敛至共享局部最优解;此外,借助局部平滑正则化和得分匹配损失优化进一步提升保护图像的感知质量,从而在定性和定量指标上均优于噪声、妆容及自由属性等基线方法,并展现出对在线FR API的有效对抗能力和在罕见拍摄场景中的高适应性。

链接: https://arxiv.org/abs/2603.03665
作者: Binh M. Le,Simon S. Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Proceedings of the AAAI Conference on Artificial Intelligence 40

点击查看摘要

Abstract:The proliferation of facial recognition (FR) systems has raised privacy concerns in the digital realm, as malicious uses of FR models pose a significant threat. Traditional countermeasures, such as makeup style transfer, have suffered from low transferability in black-box settings and limited applicability across various demographic groups, including males and individuals with darker skin tones. To address these challenges, we introduce a novel facial privacy protection method, dubbed \textbfMAP, a pioneering approach that employs human emotion modifications to disguise original identities as target identities in facial images. Our method uniquely fine-tunes a score network to learn dual objectives, target identity and human expression, which are jointly optimized through gradient projection to ensure convergence at a shared local optimum. Additionally, we enhance the perceptual quality of protected images by applying local smoothness regularization and optimizing the score matching loss within our network. Empirical experiments demonstrate that our innovative approach surpasses previous baselines, including noise-based, makeup-based, and freeform attribute methods, in both qualitative fidelity and quantitative metrics. Furthermore, MAP proves its effectiveness against an online FR API and shows advanced adaptability in uncommon photographic scenarios.

[CV-98] InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models CVPR

【速读】:该论文旨在解决当前多模态生成模型在图像编辑任务中普遍缺乏对动态推理能力的问题,尤其是难以建模从初始状态到最终状态之间连贯的中间逻辑路径,从而限制了其在程序性与因果理解层面的深度发展。解决方案的关键在于提出首个专注于评估图像编辑过程中中间路径推理能力的基准测试工具InEdit-Bench,该基准包含四类精心标注的任务类别(状态转换、动态过程、时间序列和科学模拟),并设计了一套细粒度评估标准,用于衡量生成路径的逻辑一致性、视觉自然性和对指定路径约束的保真度,从而系统性地揭示现有模型在此领域的显著不足,并推动更具动态推理能力的智能多模态生成模型的发展。

链接: https://arxiv.org/abs/2603.03657
作者: Zhiqiang Sheng,Xumeng Han,Zhiwei Zhang,Zenghui Xiong,Yifan Ding,Aoxiang Ping,Xiang Li,Tong Guo,Yao Mao
机构: Chinese Academy of Sciences (中国科学院); National Laboratory on Adaptive Optics (自适应光学国家实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR findings. Project page: this https URL

点击查看摘要

Abstract:Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model’s fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.

[CV-99] Field imaging framework for morphological characterization of aggregates with computer vision: Algorithms and applications

【速读】:该论文旨在解决当前建筑骨料(construction aggregates)形态特征表征方法中存在的局限性问题,即传统手段依赖人工目视和测量,而现有成像技术仅适用于规则尺寸且在受控环境下的骨料,难以满足现场多场景复杂工况的需求。其核心解决方案是构建一个面向多种应用场景的场域成像框架(field imaging framework),关键在于提出并实现了一套集成化的三维重建-分割-补全(Reconstruction-Segmentation-Completion, RSC-3D)方法:首先通过多视角图像实现高保真三维建模并建立骨料颗粒库;随后基于该库生成两类用于深度学习的数据集——带实例标签的合成堆场数据与部分完整形状对;在此基础上分别训练了先进的3D实例分割网络和3D形状补全网络,从而实现了对堆场中骨料的自动化2D/3D形态分析及未见面的几何预测,在真实场景中验证了其良好性能。

链接: https://arxiv.org/abs/2603.03654
作者: Haohang Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: PhD thesis

点击查看摘要

Abstract:Construction aggregates, including sand and gravel, crushed stone and riprap, are the core building blocks of the construction industry. State-of-the-practice characterization methods mainly relies on visual inspection and manual measurement. State-of-the-art aggregate imaging methods have limitations that are only applicable to regular-sized aggregates under well-controlled conditions. This dissertation addresses these major challenges by developing a field imaging framework for the morphological characterization of aggregates as a multi-scenario solution. For individual and non-overlapping aggregates, a field imaging system was designed and the associated segmentation and volume estimation algorithms were developed. For 2D image analyses of aggregates in stockpiles, an automated 2D instance segmentation and morphological analysis approach was established. For 3D point cloud analyses of aggregate stockpiles, an integrated 3D Reconstruction-Segmentation-Completion (RSC-3D) approach was established: 3D reconstruction procedures from multi-view images, 3D stockpile instance segmentation, and 3D shape completion to predict the unseen sides. First, a 3D reconstruction procedure was developed to obtain high-fidelity 3D models of collected aggregate samples, based on which a 3D aggregate particle library was constructed. Next, two datasets were derived from the 3D particle library for 3D learning: a synthetic dataset of aggregate stockpiles with ground-truth instance labels, and a dataset of partial-complete shape pairs, developed with varying-view raycasting schemes. A state-of-the-art 3D instance segmentation network and a 3D shape completion network were trained on the datasets, respectively. The application of the integrated approach was demonstrated on real stockpiles and validated with ground-truth, showing good performance in capturing and predicting the unseen sides of aggregates.

[CV-100] One-Step Face Restoration via Shortcut-Enhanced Coupling Flow

【速读】:该论文旨在解决基于流匹配(Flow Matching, FM)的图像修复方法在人脸修复任务中存在路径交叉、轨迹弯曲及需要多步采样等问题,这些问题源于现有方法从高斯噪声出发,忽略了低质量(Low-Quality, LQ)与高质量(High-Quality, HQ)数据之间的内在依赖关系。解决方案的关键在于提出一种增强捷径耦合流(Shortcut-enhanced Coupling flow for Face Restoration, SCFlowFR):首先建立数据依赖的耦合机制以显式建模LQ–HQ依赖关系,从而减少路径交叉并促进近线性传输;其次引入条件均值估计获得粗略预测,优化源锚点并稳定大步长更新的速率场;最后通过捷径约束监督任意时间区间内的平均速度,实现精确的一步推断。

链接: https://arxiv.org/abs/2603.03648
作者: Xiaohui Sun,Hanlin Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face restoration has advanced significantly with generative models like diffusion models and flow matching (FM), which learn continuous-time mappings between distributions. However, existing FM-based approaches often start from Gaussian noise, ignoring the inherent dependency between low-quality (LQ) and high-quality (HQ) data, resulting in path crossovers, curved trajectories, and multi-step sampling requirements. To address these issues, we propose Shortcut-enhanced Coupling flow for Face Restoration (SCFlowFR). First, it establishes a \textitdata-dependent coupling that explicitly models the LQ–HQ dependency, minimizing path crossovers and promoting near-linear transport. Second, we employ conditional mean estimation to obtain a coarse prediction that refines the source anchor to tighten coupling and conditions the velocity field to stabilize large-step updates. Third, a shortcut constraint supervises average velocities over arbitrary time intervals, enabling accurate one-step inference. Experiments demonstrate that SCFlowFR achieves state-of-the-art one-step face restoration quality with inference speed comparable to traditional non-diffusion methods.

[CV-101] InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

【速读】:该论文旨在解决长篇故事视频生成中视觉叙事一致性不足的问题,具体包括跨镜头背景一致性、多主体场景下的无缝镜头过渡以及小时级叙事的可扩展性挑战。其解决方案的关键在于提出了一种背景一致性的生成流水线,确保场景间视觉连贯性并保持角色身份与空间关系;同时引入一个过渡感知的视频合成模块,能够处理多个主体进出画面的复杂过渡场景,突破以往仅支持单主体过渡的局限;此外,构建了一个包含10,000个多主体过渡序列的合成数据集,以支撑上述方法的有效训练与评估。

链接: https://arxiv.org/abs/2603.03646
作者: Mohamed Elmoghany,Liangbing Zhao,Xiaoqian Shen,Subhojyoti Mukherjee,Yang Zhou,Gang Wu,Viet Dac Lai,Seunghyun Yoon,Ryan Rossi,Abdullah Rashwan,Puneet Mathur,Varun Manjunatha,Daksh Dangi,Chien Nguyen,Nedim Lipka,Trung Bui,Krishna Kumar Singh,Ruiyi Zhang,Xiaolei Huang,Jaemin Cho,Yu Wang,Namyong Park,Zhengzhong Tu,Hongjie Chen,Hoda Eldardiry,Nesreen Ahmed,Thien Nguyen,Dinesh Manocha,Mohamed Elhoseiny,Franck Dernoncourt
机构: Adobe Research(Adobe研究院); KAUST; Independent Researcher(独立研究员); University of Oregon(俄勒冈大学); University of Memphis(孟菲斯大学); Johns Hopkins University(约翰霍普金斯大学); Meta AI(元AI); Texas AM University(德州农工大学); Dolby Labs(杜比实验室); Virginia Tech(弗吉尼亚理工大学); Cisco(思科); University of Maryland, College Park(马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.

[CV-102] Image-based Prompt Injection: Hijacking Multimodal LLM s through Visually Embedded Adversarial Instructions

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-文本融合过程中引入的新安全漏洞问题,特别是图像引导的提示注入攻击(Image-based Prompt Injection, IPI)这一黑盒攻击形式。其核心问题是:如何在不被人类察觉的前提下,通过嵌入对抗性指令到自然图像中,实现对模型行为的可靠操控。解决方案的关键在于构建一个端到端的IPI攻击管道,包含基于分割的区域选择、自适应字体缩放和背景感知渲染三个核心技术模块,以在保持模型可解释性的同时有效隐藏攻击提示,从而在隐蔽约束下实现高达64%的攻击成功率。

链接: https://arxiv.org/abs/2603.03637
作者: Neha Nagaraja,Lan Zhang,Zhilong Wang,Bo Zhang,Pawan Patil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 7 pages, published in 2025 3rd International Conference on Foundation and Large Language Models (FLLM), Vienna, Austria

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) integrate vision and text to power applications, but this integration introduces new vulnerabilities. We study Image-based Prompt Injection (IPI), a black-box attack in which adversarial instructions are embedded into natural images to override model behavior. Our end-to-end IPI pipeline incorporates segmentation-based region selection, adaptive font scaling, and background-aware rendering to conceal prompts from human perception while preserving model interpretability. Using the COCO dataset and GPT-4-turbo, we evaluate 12 adversarial prompt strategies and multiple embedding configurations. The results show that IPI can reliably manipulate the output of the model, with the most effective configuration achieving up to 64% attack success under stealth constraints. These findings highlight IPI as a practical threat in black-box settings and underscore the need for defenses against multimodal prompt injection.

[CV-103] Extending Neural Operators: Robust Handling of Functions Beyond the Training Set

【速读】:该论文旨在解决神经算子(Neural Operators)在面对分布外(out-of-distribution)输入函数时的泛化能力不足问题。其解决方案的关键在于引入核近似(kernel approximation)技术,并基于再生核希尔伯特空间(Reproducing Kernel Hilbert Spaces, RKHS)理论对输入-输出函数空间进行刻画,从而建立可靠的扩展条件及其预测逼近精度的理论保障。进一步地,通过明确特定核函数选择与对应Sobolev原生空间(Sobolev Native Spaces)之间的形式关系,使得扩展后的神经算子不仅能准确捕捉函数值,还能可靠地表示其导数信息。这一方法在流形上椭圆型偏微分方程(elliptic PDEs)的求解中得到验证,尤其适用于具有点云表示和几何贡献处理的场景。

链接: https://arxiv.org/abs/2603.03621
作者: Blaine Quackenbush,Paul J. Atzberger
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: related open source software see this https URL

点击查看摘要

Abstract:We develop a rigorous framework for extending neural operators to handle out-of-distribution input functions. We leverage kernel approximation techniques and provide theory for characterizing the input-output function spaces in terms of Reproducing Kernel Hilbert Spaces (RKHSs). We provide theorems on the requirements for reliable extensions and their predicted approximation accuracy. We also establish formal relationships between specific kernel choices and their corresponding Sobolev Native Spaces. This connection further allows the extended neural operators to reliably capture not only function values but also their derivatives. Our methods are empirically validated through the solution of elliptic partial differential equations (PDEs) involving operators on manifolds having point-cloud representations and handling geometric contributions. We report results on key factors impacting the accuracy and computational performance of the extension approaches.

[CV-104] CoRe-BT: A Multimodal Radiology-Pathology-Text Benchmark for Robust Brain Tumor Typing MICCAI2026

【速读】:该论文旨在解决脑肿瘤分型中因临床证据(如磁共振成像MRI、组织病理学和病理报告)不完整而导致的诊断准确性受限问题,特别是在多模态数据缺失的现实场景下如何实现鲁棒的多模态学习。解决方案的关键在于构建了一个名为CoRe-BT的跨模态放射组学-病理学-文本基准数据集,其中包含310例患者的多序列MRI(T1、T1c、T2、FLAIR)、95例配对的HE染色全切片病理图像及病理报告,并标注了肿瘤类型与分级信息,同时提供专家标注的肿瘤掩膜以支持区域感知建模和辅助学习任务。通过对比仅使用MRI的模型与融合病理信息的多模态方法在不同模态可用性条件下的表现,验证了多模态融合的可行性及其在不同临床分型任务中的互补贡献,为真实世界中不完整数据条件下脑胶质瘤的多模态分型与表示学习提供了可落地的研究平台。

链接: https://arxiv.org/abs/2603.03618
作者: Juampablo E. Heras Rivera,Daniel K. Low,Xavier Xiong,Jacob J. Ruzevick,Daniel D. Child,Wen-wai Yim,Mehmet Kurt,Asma Ben Abacha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, MICCAI 2026

点击查看摘要

Abstract:Accurate brain tumor typing requires integrating heterogeneous clinical evidence, including magnetic resonance imaging (MRI), histopathology, and pathology reports, which are often incomplete at the time of diagnosis. We introduce CoRe-BT, a cross-modal radiology-pathology-text benchmark for brain tumor typing, designed to study robust multimodal learning under missing modality conditions. The dataset comprises 310 patients with multi-sequence brain MRI (T1, T1c, T2, FLAIR), including 95 cases with paired HE-stained whole-slide pathology images and pathology reports. All cases are annotated with tumor type and grade, and MRI volumes include expert-annotated tumor masks, enabling both region-aware modeling and auxiliary learning tasks. Tumors are categorized into six clinically relevant classes capturing the heterogeneity of common and rare glioma subtypes. We evaluate tumor typing under variable modality availability by comparing MRI-only models with multimodal approaches that incorporate pathology information when present. Baseline experiments demonstrate the feasibility of multimodal fusion and highlight complementary modality contributions across clinically relevant typing tasks. CoRe-BT provides a grounded testbed for advancing multimodal glioma typing and representation learning in realistic scenarios with incomplete clinical data.

[CV-105] RAG Track: Language-aware RGBT Tracking with Retrieval-Augmented Generation CVPR2026

【速读】:该论文旨在解决RGB-Thermal(RGBT)跟踪中因缺乏语言引导导致的目标模型无法适应外观变化、搜索区域冗余以及多模态间隙引发背景干扰等问题。其解决方案的关键在于提出RAGTrack框架,通过引入文本描述增强目标建模能力:首先利用多模态大语言模型(MLLMs)自动生成文本标注;进而设计多模态Transformer编码器(MTE)实现视觉-语言统一建模,采用自适应令牌融合(ATF)机制基于跨模态相关性选择目标相关令牌并进行通道交换,以减少冗余搜索和缓解模态差异;最后构建上下文感知推理模块(CRM),结合检索增强生成(RAG)技术建立动态知识库,实现时序语言推理,从而提升复杂场景下的鲁棒性目标建模性能。

链接: https://arxiv.org/abs/2603.03617
作者: Hao Li,Yuhao Wang,Wenning Hao,Pingping Zhang,Dong Wang,Huchuan Lu
机构: Army Engineering University of PLA (中国人民解放军陆军工程大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by CVPR2026. More modifications may be performed

点击查看摘要

Abstract:RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available this https URL.

[CV-106] LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark

【速读】:该论文旨在解决开放田间环境下林木幼苗叶片细粒度实例分割的难题,其核心挑战包括尺度变化、光照差异及不规则叶形带来的识别困难。为应对这些问题,作者构建了首个面向林业叶片的开放场景区分数据集Poplar-leaf,并提出LeafInst框架,其关键创新在于:引入渐近特征金字塔网络(Asymptotic Feature Pyramid Network, AFPN)增强多尺度感知能力;设计动态非对称空间感知模块(Dynamic Asymmetric Spatial Perception, DASP)以建模复杂不规则叶形;并采用双残差动态异常回归头(Dual-residual Dynamic Anomalous Regression Head, DARH)结合自顶向下拼接特征融合(Top-down Concatenation decoder Feature Fusion, TCFU)机制,显著提升检测与分割性能,在Poplar-leaf数据集上达到68.4 mAP,优于YOLOv11和MaskDINO,同时在PhenoBench基准上也展现出优越的泛化能力和实际应用价值。

链接: https://arxiv.org/abs/2603.03616
作者: Taige Luo,Junru Xie,Chenyang Fan,Bingrong Liu,Ruisheng Wang,Yang Shao,Sheng Xu,Lin Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intelligent forest tree breeding has advanced plant phenotyping, yet existing research largely focuses on large-leaf agricultural crops, with limited attention to fine-grained leaf analysis of sapling trees in open-field environments. Natural scenes introduce challenges including scale variation, illumination changes, and irregular leaf morphology. To address these issues, we collected UAV RGB imagery of field-grown saplings and constructed the Poplar-leaf dataset, containing 1,202 branches and 19,876 pixel-level annotated leaf instances. To our knowledge, this is the first instance segmentation dataset specifically designed for forestry leaves in open-field conditions. We propose LeafInst, a novel segmentation framework tailored for irregular and multi-scale leaf structures. The model integrates an Asymptotic Feature Pyramid Network (AFPN) for multi-scale perception, a Dynamic Asymmetric Spatial Perception (DASP) module for irregular shape modeling, and a dual-residual Dynamic Anomalous Regression Head (DARH) with Top-down Concatenation decoder Feature Fusion (TCFU) to improve detection and segmentation performance. On Poplar-leaf, LeafInst achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent. On the public PhenoBench benchmark, it reaches 52.7 box mAP, exceeding MaskDINO by 3.4 percent. Additional experiments demonstrate strong generalization and practical utility for large-scale leaf phenotyping.

[CV-107] Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression CVPR2026

【速读】:该论文旨在解决分布式多视角图像压缩(Distributed Multi-View Image Compression, DMIC)中因忽略不同视图间相关性差异而导致的编码性能次优问题。现有方法通常对所有视图同等处理,未能根据实际相关程度自适应地融合信息,从而限制了压缩效率的提升。解决方案的关键在于提出一种通用的全向视差注意力机制(OmniParallax Attention Mechanism, OPAM),用于显式建模任意信息源之间的关联性和对齐特征;在此基础上进一步设计了视差多信息融合模块(Parallax Multi Information Fusion Module, PMIFM),能够自适应整合来自不同视图的信息,并将其嵌入联合解码器和熵模型中,构建端到端的DMIC框架ParaHydra。实验表明,该方法是首个显著超越先进多视角图像压缩(MIC)编解码器的DMIC方案,在保持低计算开销的同时,随着输入视图数增加,比特率节省效果更加显著(如WildTrack(6)上达24.18%)。

链接: https://arxiv.org/abs/2603.03615
作者: Haotian Zhang,Feiyue Long,Yixin Yu,Jian Xue,Haocheng Tang,Tongda Xu,Zhenning Shi,Yan Wang,Siwei Ma,Jiaqi Zhang
机构: Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multi-view image compression (MIC) aims to achieve high compression efficiency by exploiting inter-image correlations, playing a crucial role in 3D applications. As a subfield of MIC, distributed multi-view image compression (DMIC) offers performance comparable to MIC while eliminating the need for inter-view information at the encoder side. However, existing methods in DMIC typically treat all images equally, overlooking the varying degrees of correlation between different views during decoding, which leads to suboptimal coding performance. To address this limitation, we propose a novel \textbfOmniParallax Attention Mechanism (OPAM), which is a general mechanism for explicitly modeling correlations and aligned features between arbitrary pairs of information sources. Building upon OPAM, we propose a Parallax Multi Information Fusion Module (PMIFM) to adaptively integrate information from different sources. PMIFM is incorporated into both the joint decoder and the entropy model to construct our end-to-end DMIC framework, \textbfParaHydra . Extensive experiments demonstrate that \textbfParaHydra is \textbfthe first DMIC method to significantly surpass state-of-the-art MIC codecs, while maintaining low computational overhead. Performance gains become more pronounced as the number of input views increases. Compared with LDMIC, \textbfParaHydra achieves bitrate savings of \textbf19.72% on WildTrack(3) and up to \textbf24.18% on WildTrack(6), while significantly improving coding efficiency (as much as \textbf65\times in decoding and \textbf34\times in encoding).

[CV-108] racking Feral Horses in Aerial Video Using Oriented Bounding Boxes

【速读】:该论文旨在解决在航拍视角下对群体动物(如野马)进行精准个体追踪时,传统轴对齐边界框(Axis-Aligned Bounding Boxes, AABbs)因复杂背景、目标尺寸小、密度高及姿态变化导致性能下降的问题,以及现有定向边界框(Oriented Bounding Boxes, OBBs)检测器受限于180°角度范围无法区分头部与尾部、易出现帧间180°突变翻转从而破坏连续追踪的缺陷。解决方案的关键在于提出一种基于头向估计的方法:通过裁剪OBB中心区域的图像块,分别使用头部、尾部和头尾联合三个检测器进行预测,并采用IoU加权多数投票机制确定最终标签,从而实现高精度的姿态判别(实验表明准确率达99.3%),显著提升了OBB在群体动物追踪中的鲁棒性与连续性。

链接: https://arxiv.org/abs/2603.03604
作者: Saeko Takizawa,Tamao Maeda,Shinya Yamamoto,Hiroaki Kawashima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Author’s version of the paper presented at AROB-ISBC 2026

点击查看摘要

Abstract:The social structures of group-living animals such as feral horses are diverse and remain insufficiently understood, even within a single species. To investigate group dynamics, aerial videos are often utilized to track individuals and analyze their movement trajectories, which are essential for evaluating inter-individual interactions and comparing social behaviors. Accurate individual tracking is therefore crucial. In multi-animal tracking, axis-aligned bounding boxes (bboxes) are widely used; however, for aerial top-view footage of entire groups, their performance degrades due to complex backgrounds, small target sizes, high animal density, and varying body orientations. To address this issue, we employ oriented bounding boxes (OBBs), which include rotation angles and reduce unnecessary background. Nevertheless, current OBB detectors such as YOLO-OBB restrict angles within a 180 ^\circ range, making it impossible to distinguish head from tail and often causing sudden 180 ^\circ flips across frames, which severely disrupts continuous tracking. To overcome this limitation, we propose a head-orientation estimation method that crops OBB-centered patches, applies three detectors (head, tail, and head-tail), and determines the final label through IoU-based majority voting. Experiments using 299 test images show that our method achieves 99.3% accuracy, outperforming individual models, demonstrating its effectiveness for robust OBB-based tracking.

[CV-109] Detection and Identification of Penguins Using Appearance and Motion Features

【速读】:该论文旨在解决动物设施中企鹅个体监测的难题,主要挑战包括企鹅外观高度相似、姿态变化迅速频繁以及环境噪声(如水体反射)干扰导致的检测与识别困难。解决方案的关键在于融合视觉外观特征与运动信息:在检测阶段,通过改进YOLO11模型以输入连续两帧图像,利用时间维度上的运动线索增强目标检测鲁棒性,使模型在单帧图像中难以区分的目标仍可被有效捕捉;在识别阶段,采用基于轨迹片段(tracklet)的对比学习方法,在跟踪后对同一个体的多帧特征进行优化,生成更具判别力的嵌入表示,从而降低ID切换问题,提升个体身份一致性。

链接: https://arxiv.org/abs/2603.03603
作者: Kasumi Seko,Hiroki Kinoshita,Raj Rajeshwar Malinda,Hiroaki Kawashima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Author’s version of the paper presented at AROB-ISBC 2026

点击查看摘要

Abstract:In animal facilities, continuous surveillance of penguins is essential yet technically challenging due to their homogeneous visual characteristics, rapid and frequent posture changes, and substantial environmental noise such as water reflections. In this study, we propose a framework that enhances both detection and identification performance by integrating appearance and motion features. For detection, we adapted YOLO11 to process consecutive frames to overcome the lack of temporal consistency in single-frame detectors. This approach leverages motion cues to detect targets even when distinct visual features are obscured. Our evaluation shows that fine-tuning the model with two-frame inputs improves mAP@0.5 from 0.922 to 0.933, outperforming the baseline, and successfully recovers individuals that are indistinguishable in static images. For identification, we introduce a tracklet-based contrastive learning approach applied after tracking. Through qualitative visualization, we demonstrate that the method produces coherent feature embeddings, bringing samples from the same individual closer in the feature space, suggesting the potential for mitigating ID switching.

[CV-110] DM-CFO: A Diffusion Model for Compositional 3D Tooth Generation with Collision-Free Optimization

【速读】:该论文旨在解决3D牙齿模型自动设计中缺失牙齿的组合式生成问题,尤其针对现有方法在布局与形状重建上的不足,以及基于3D高斯(3D Gaussian)的组合生成中常出现的物体间碰撞冲突问题(即牙齿之间发生交叠,因缺乏显式的表面几何信息)。其解决方案的关键在于提出一种名为DM-CFO的方法:首先在去噪过程中受文本和图结构约束逐步恢复缺失牙齿的布局;随后通过得分蒸馏采样(Score Distillation Sampling, SDS)交替优化每颗牙齿及整个牙弓的高斯参数;并引入基于邻近牙齿与锚定牙齿之间3D高斯点距离的正则化项,以惩罚牙齿间的交叠,从而有效避免碰撞。实验表明,该方法显著提升了生成牙齿的多视角一致性和真实感。

链接: https://arxiv.org/abs/2603.03602
作者: Yan Tian,Pengcheng Xue,Weiping Ding,Mahmoud Hassaballah,Karen Egiazarian,Aura Conci,Abdulkadir Sengur,Leszek Rutkowski
机构: Zhejiang Gongshang University (浙江工商大学); Shining3D Tech Co., Ltd. (闪耀三维科技有限公司); Nantong University (南通大学); Prince Sattam Bin Abdulaziz University (Prince Sattam Bin Abdulaziz大学); Qena University (Qena大学); Tampere University (坦佩雷大学); Universidade Federal Fluminense (联邦弗洛里纳大学); Firat University (法伊尔特大学); Systems Research Institute of the Polish Academy of Sciences (波兰科学院系统研究所); AGH University of Krakow (克拉科夫AGH大学); SAN University (SAN大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Received by IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:The automatic design of a 3D tooth model plays a crucial role in dental digitization. However, current approaches face challenges in compositional 3D tooth generation because both the layouts and shapes of missing teeth need to be this http URL addition, collision conflicts are often omitted in 3D Gaussian-based compositional 3D generation, where objects may intersect with each other due to the absence of explicit geometric information on the object surfaces. Motivated by graph generation through diffusion models and collision detection using 3D Gaussians, we propose an approach named DM-CFO for compositional tooth generation, where the layout of missing teeth is progressively restored during the denoising phase under both text and graph constraints. Then, the Gaussian parameters of each layout-guided tooth and the entire jaw are alternately updated using score distillation sampling (SDS). Furthermore, a regularization term based on the distances between the 3D Gaussians of neighboring teeth and the anchor tooth is introduced to penalize tooth intersections. Experimental results on three tooth-design datasets demonstrate that our approach significantly improves the multiview consistency and realism of the generated teeth compared with existing methods. Project page: this https URL.

[CV-111] Hazard-Aware Traffic Scene Graph Generation

【速读】:该论文旨在解决复杂驾驶场景中维持情境意识(situational awareness)的难题,特别是现有方法在识别语义类别和视觉显著区域方面的局限性——即缺乏对安全相关性的评估能力。同时,传统场景图(scene graph)模型中的通用空间谓词无法有效刻画交通场景中突出危险源与自车(ego vehicle)之间的特异性关系。解决方案的关键在于提出一种新任务“交通场景图生成”(Traffic Scene Graph Generation),通过引入交通事故数据和深度线索来增强视觉特征与语义信息,从而实现以自车为中心的危险感知推理;其输出的交通场景图以颜色编码标注危险严重程度,并注明影响机制与相对位置,提供直观的安全导向理解。

链接: https://arxiv.org/abs/2603.03584
作者: Yaoqi Huang,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: Australian Center for Robotics (澳大利亚机器人中心); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maintaining situational awareness in complex driving scenarios is challenging. It requires continuously prioritizing attention among extensive scene entities and understanding how prominent hazards might affect the ego vehicle. While existing studies excel at detecting specific semantic categories and visually salient regions, they lack the ability to assess safety-relevance. Meanwhile, the generic spatial predicates either for foreground objects only or for all scene entities modeled by existing scene graphs are inadequate for driving scenarios. To bridge this gap, we introduce a novel task, Traffic Scene Graph Generation, which captures traffic-specific relations between prominent hazards and the ego vehicle. We propose a novel framework that explicitly uses traffic accident data and depth cues to supplement visual features and semantic information for reasoning. The output traffic scene graphs provide intuitive guidelines that stress prominent hazards by color-coding their severity and notating their effect mechanism and relative location to the ego vehicle. We create relational annotations on Cityscapes dataset and evaluate our model on 10 tasks from 5 perspectives. The results in comparative experiments and ablation studies demonstrate our capacity in ego-centric reasoning for hazard-aware traffic scene understanding.

[CV-112] An Effective Data Augmentation Method by Asking Questions about Scene Text Images ICASSP2026

【速读】:该论文旨在解决场景文本识别(Scene Text Recognition, STR)与手写文本识别(Handwritten Text Recognition, HTR)中因传统光学字符识别(OCR)模型直接预测文本而导致的细粒度结构推理不足的问题。其解决方案的关键在于提出一种受视觉问答(VQA)启发的数据增强框架,通过为每张图像-文本对生成自然语言问题(如字符存在性、位置和频率等字符级属性),并以真实文本作为答案,引导模型在训练过程中进行更精细的跨模态推理;该方法促使OCR模型将视觉特征与文本查询对齐,从而联合推理图像内容与问题语义,显著提升了识别准确率,在WordArt和Esposalles数据集上均实现了字符错误率(CER)和词错误率(WER)的降低。

链接: https://arxiv.org/abs/2603.03580
作者: Xu Yao,Lei Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant reductions in both CER and WER. Our code is publicly available at this https URL.

[CV-113] Spectrum Shortage for Radio Sensing? Leverag ing Ambient 5G Signals for Human Activity Detection

【速读】:该论文旨在解决子10 GHz频谱中无线电感知(Radio Sensing)因频谱资源有限而导致的大规模部署难题。其核心解决方案是提出了一种名为环境无线电感知(Ambient Radio Sensing, ARS)的新型集成感知与通信(Integrated Sensing and Communications, ISAC)方法,关键在于通过被动接收现有无线系统(如5G和Wi-Fi)的空中传播信号,将其重用于感知任务,同时不干扰原通信功能。ARS采用自混频射频(RF)架构实现对环境目标的反射信号采集,并提取鲁棒的多普勒与角度特征,从而在无需额外频谱资源的情况下完成人体姿态估计与掩码分割等应用。

链接: https://arxiv.org/abs/2603.03579
作者: Kunzhe Song,Maxime Zingraff,Huacheng Zeng
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radio sensing in the sub-10 GHz spectrum offers unique advantages over traditional vision-based systems, including the ability to see through occlusions and preserve user privacy. However, the limited availability of spectrum in this range presents significant challenges for deploying largescale radio sensing applications. In this paper, we introduce Ambient Radio Sensing (ARS), a novel Integrated Sensing and Communications (ISAC) approach that addresses spectrum scarcity by repurposing over-the-air radio signals from existing wireless systems (e.g., 5G and Wi-Fi) for sensing applications, without interfering with their primary communication functions. ARS operates as a standalone device that passively receives communication signals, amplifies them to illuminate surrounding objects, and captures the reflected signals using a self-mixing RF architecture to extract baseband features. This hardware innovation enables robust Doppler and angular feature extraction from ambient OFDM signals. To support downstream applications, we propose a cross-modal learning framework focusing on human activity recognition, featuring a streamlined training process that leverages an off-the-shelf vision model to supervise radio model training. We have developed a prototype of ARS and validated its effectiveness through extensive experiments using ambient 5G signals, demonstrating accurate human skeleton estimation and body mask segmentation applications.

[CV-114] From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

【速读】:该论文旨在解决开放世界环境中机器人感知中新颖物体实例的检测与分割问题,即在仅有少量模板图像的情况下,定位并分割复杂、未见过场景中的特定物体实例。解决方案的关键在于提出一种局部到全局的实例检测框架L2G-Det,其核心创新是绕过传统基于提议(proposal-based)的方法,通过模板图像与查询图像之间的密集patch级匹配生成候选点,再经由候选选择模块抑制误报,并利用这些过滤后的点作为提示(prompt),驱动增强版Segment Anything Model(SAM)生成具有实例特异性的对象标记(object tokens),从而实现完整且可靠的实例掩码重建。

链接: https://arxiv.org/abs/2603.03577
作者: Qifan Zhang,Sai Haneesh Allu,Jikai Wang,Yangxiao Lu,Yu Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

[CV-115] Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

【速读】:该论文旨在解决微创手术(Minimally Invasive Surgery, MIS)中单目深度估计(Monocular Depth Estimation, MDE)模型在烟雾、镜面反射、模糊和遮挡等噪声与伪影干扰下精度受限,且缺乏深度置信度输出的问题。解决方案的关键在于提出一种置信度感知的MDE框架,其核心创新包括:(i) 使用校准的置信度目标——通过微调的立体匹配模型集成捕捉视差方差以生成像素级置信概率;(ii) 置信度感知损失函数——使基线MDE模型在训练中优先利用高置信度像素;(iii) 推理时置信度估计头——基于两个卷积层预测每像素置信度,实现对深度预测可靠性的实时评估。该方法显著提升了深度估计准确性,并可为临床应用提供可靠的置信度图。

链接: https://arxiv.org/abs/2603.03571
作者: Muhammad Asad,Emanuele Colleoni,Pritesh Mehta,Nicolas Toussaint,Ricardo Sanchez-Matilla,Maria Robu,Faisal Bashir,Rahim Mohammadi,Imanol Luengo,Danail Stoyanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Purpose: Monocular depth estimation (MDE) is vital for scene understanding in minimally invasive surgery (MIS). However, endoscopic video sequences are often contaminated by smoke, specular reflections, blur, and occlusions, limiting the accuracy of MDE models. In addition, current MDE models do not output depth confidence, which could be a valuable tool for improving their clinical reliability. Methods: We propose a novel confidence-aware MDE framework featuring three significant contributions: (i) Calibrated confidence targets: an ensemble of fine-tuned stereo matching models is used to capture disparity variance into pixel-wise confidence probabilities; (ii) Confidence-aware loss: Baseline MDE models are optimized with confidence-aware loss functions, utilizing pixel-wise confidence probabilities such that reliable pixels dominate training; and (iii) Inference-time confidence: a confidence estimation head is proposed with two convolution layers to predict per-pixel confidence at inference, enabling assessment of depth reliability. Results: Comprehensive experimental validation across internal and public datasets demonstrates that our framework improves depth estimation accuracy and can robustly quantify the prediction’s confidence. On the internal clinical endoscopic dataset (StereoKP), we improve dense depth estimation accuracy by ~8% as compared to the baseline model. Conclusion: Our confidence-aware framework enables improved accuracy of MDE models in MIS, addressing challenges posed by noise and artifacts in pre-clinical and clinical data, and allows MDE models to provide confidence maps that may be used to improve their reliability for clinical applications.

[CV-116] Modeling Cross-vision Synergy for Unified Large Vision Model CVPR

【速读】:该论文旨在解决当前统一视觉大模型(Unified Large Vision Models, LVMs)在功能集成基础上缺乏跨模态协同推理能力的问题,即忽视了“跨视觉协同”(cross-vision synergy)——即利用不同视觉模态(如图像、视频、3D数据)之间的互补先验进行联合推理的能力。其解决方案的关键在于:架构层面采用由动态模态路由协调的稀疏专家混合(Sparse Mixture-of-Experts, MoE)结构,使每个专家专注于特定模态的先验知识,同时支持模态间双向交互与相互优化;训练层面提出一种协同感知范式(synergy-aware paradigm),通过模态特定预训练与基于知识蒸馏和对象/关系级对齐的粗到细协同调优,实现跨模态知识融合与增强。 该方法在10个涵盖图像、视频和3D理解的基准测试中显著优于现有模型,平均提升超过10%,验证了其在构建具有同步感知能力的视觉推理系统中的有效性。

链接: https://arxiv.org/abs/2603.03564
作者: Shengqiong Wu,Lanhu Wu,Mingyang Bao,Wenhao Xu,Hanwang Zhang,Shuicheng Yan,Hao Fei,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 9 figures, 16 tables, CVPR

点击查看摘要

Abstract:Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: this https URL.

[CV-117] PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest

【速读】:该论文旨在解决多模态视觉语言模型(Visual Language Models, VLMs)在推荐与检索系统中应用时面临的两大核心挑战:训练目标不一致问题和推理服务效率瓶颈。其解决方案的关键在于提出PinCLIP,一种大规模视觉表征学习方法,通过构建一种新颖的混合Vision Transformer架构,利用VLM骨干网络结合混合融合机制,实现不同粒度上的多模态内容表示;同时引入邻域对齐(neighbor alignment)目标,建模Pinterest“Pin-Board”图结构中多模态表示的跨融合关系,从而显著提升检索与排序性能,在离线评估中比Qwen等先进基线模型高出20%,在线A/B测试亦验证了其在用户参与度和冷启动内容分发方面的显著业务价值。

链接: https://arxiv.org/abs/2603.03544
作者: Josh Beal,Eric Kim,Jinfeng Rao,Rex Wu,Dmitry Kislyuk,Charles Rosenberg
机构: Pinterest Inc.(Pinterest公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While multi-modal Visual Language Models (VLMs) have demonstrated significant success across various domains, the integration of VLMs into recommendation and retrieval systems remains a challenge, due to issues like training objective discrepancies and serving efficiency bottlenecks. This paper introduces PinCLIP, a large-scale visual representation learning approach developed to enhance retrieval and ranking models at Pinterest by leveraging VLMs to learn image-text alignment. We propose a novel hybrid Vision Transformer architecture that utilizes a VLM backbone and a hybrid fusion mechanism to capture multi-modality content representation at varying granularities. Beyond standard image-to-text alignment objectives, we introduce a neighbor alignment objective to model the cross-fusion of multi-modal representations within the Pinterest Pin-Board graph. Offline evaluations show that PinCLIP outperforms state-of-the-art baselines, such as Qwen, by 20% in multi-modal retrieval tasks. Online A/B testing demonstrates significant business impact, including substantial engagement gains across all major surfaces in Pinterest. Notably, PinCLIP significantly addresses the “cold-start” problem, enhancing fresh content distribution with a 15% Repin increase in organic content and 8.7% higher click for new Ads.

[CV-118] PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型在高视觉质量下仍频繁违反物理规律的问题,其核心原因被识别为提示词中缺乏足够的物理约束,而非模型本身能力不足。解决方案的关键在于提出 PhyPrompt,一个两阶段强化学习框架:首先利用物理导向的思维链(Chain-of-Thought)数据微调大语言模型,使其在保持用户意图的前提下融入物体运动与力交互等物理原理;其次采用分阶段奖励课程的组相对策略优化(Group Relative Policy Optimization),从初期强调语义保真度逐步过渡到强化物理常识一致性,实现协同优化。该方法在 VideoPhy2 基准上取得显著提升,同时增强物理合理性(+11pp)和语义契合度(+4.4pp),且无需额外参数规模即可超越 GPT-4o 和 DeepSeek-V3 等更大模型,并具备跨多种 T2V 架构的零样本迁移能力。

链接: https://arxiv.org/abs/2603.03505
作者: Shang Wu,Chenwei Xu,Zhuofan Xia,Weijian Li,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Han Liu
机构: Northwestern University (西北大学); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8% to 66.8%) while simultaneously increasing semantic adherence by 4.4pp (43.4% to 47.8%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8% joint) and DeepSeek-V3 (+2.2%, 100 \times larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

[CV-119] Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1 RCM and AMSR2 Data

【速读】:该论文旨在解决极地海冰浓度(Sea Ice Concentration, SIC)高分辨率制图中不确定性量化不足的问题,尤其针对冰特征细微、标签不精确、模型不确定性及多源数据异质性等挑战。其核心解决方案包括:1)设计一种融合全局与局部模块的高分辨率Transformer模型,以增强对冰裂隙、冰融湖和浮冰等微小特征的提取能力;2)引入地理加权弱监督损失函数,在区域层面而非像素层面进行训练,优先强化纯开水域和冰盖区的特征,减少边缘混合区(Marginal Ice Zone, MIZ)模糊性影响;3)通过贝叶斯扩展将模型参数视为随机变量,实现更有效的不确定性量化;4)在决策层融合Sentinel-1、RADARSAT Constellation Mission (RCM) 和 Advanced Microwave Scanning Radiometer 2 (AMSR2) 三种遥感数据,提升SIC制图精度与不确定性估计的一致性。

链接: https://arxiv.org/abs/2603.03503
作者: Mabel Heffring,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 20 figures

点击查看摘要

Abstract:Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript2 = 0.90 relative to the ARTIST Sea Ice product).

[CV-120] Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

【速读】:该论文旨在解决当前视频扩散模型在生成过程中缺乏细粒度物理一致性的问题,即模型虽然能生成逼真的视觉内容,但在时间维度上常表现出违背物理规律的动态行为。解决方案的关键在于提出一个三阶段训练范式(Phys4D),首先通过大规模伪监督预训练建立鲁棒的几何与运动表示基础;其次利用仿真生成数据进行物理引导的监督微调,以强化时序一致的四维(4D)动态特性;最后采用仿真引导的强化学习纠正难以通过显式监督捕捉的残余物理违规行为,从而实现从外观驱动到物理一致的4D世界表征跃迁。

链接: https://arxiv.org/abs/2603.03485
作者: Haoran Lu,Shang Wu,Jianshu Zhang,Maojiang Su,Guo Ye,Chenwei Xu,Lie Lu,Pranav Maneriker,Fan Du,Manling Li,Zhaoran Wang,Han Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbfPhys4D, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbfa three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf4D world consistency evaluation that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at this https URL

[CV-121] Beyond Pixel Histories: World Models with Persistent 3D State

【速读】:该论文旨在解决现有交互式世界模型(interactive world models)在3D一致性与空间记忆方面的局限性问题。当前模型通常缺乏显式的3D环境表示,导致其依赖数据隐式学习3D一致性,且空间记忆受限于短时上下文窗口,从而影响生成内容的真实感及下游任务(如智能体训练)的可行性。解决方案的关键在于提出PERSIST范式——一种模拟潜在3D场景(包括环境、相机和渲染器)演化的世界模型,通过显式建模3D结构实现持久的空间记忆与几何一致性,从而支持长时程稳定生成,并具备从单张图像合成多样化3D环境以及在3D空间中进行细粒度编辑与控制的新能力。

链接: https://arxiv.org/abs/2603.03482
作者: Samuel Garcin,Thomas Walker,Steven McDonagh,Tim Pearce,Hakan Bilen,Tianyu He,Kaixin Wang,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Currently under review

点击查看摘要

Abstract:Interactive world models continually generate video by responding to a user’s actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: this https URL

[CV-122] Impact of Localization Errors on Label Quality for Online HD Map Construction

【速读】:该论文旨在解决在线高精地图(High-definition map, HD map)构建中因车辆车队传感器数据存在定位误差而导致的地图标签失真问题。现有方法利用已有的HD地图作为车载传感器数据的标签,但消费者车队数据常因定位不准确(如航向角误差和位置误差)造成训练标签扭曲,进而影响模型性能。解决方案的关键在于:首先系统性引入三种类型的定位误差(Ramp、Gaussian 和 Perlin 噪声)模拟真实场景中的偏差;其次,通过在 Argoverse 2 数据集上训练 MapTRv2 模型并评估不同噪声水平下的性能退化,发现航向角误差比位置误差更具破坏性,因其随距离增加导致更严重的标签畸变;最后提出基于距离的评价指标,以区分远距离标签失真对驾驶任务影响较小的特点,并验证模型性能随噪声数据比例非线性下降,凸显高质量无畸变真值数据的重要性。

链接: https://arxiv.org/abs/2603.03452
作者: Alexander Blumberg,Jonas Merkert,Richard Fehler,Fabian Immel,Frank Bieder,Jan-Hendrik Pauls,Christoph Stiller
机构: Institute of Measurement and Control Systems, Karlsruhe Institute of Technology (KIT); FZI Research Center for Information Technology, Karlsruhe, Germany
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the 36th IEEE Intelligent Vehicles Symposium (IV 2025), 8 pages

点击查看摘要

Abstract:High-definition (HD) maps are crucial for autonomous vehicles, but their creation and maintenance is very costly. This motivates the idea of online HD map construction. To provide a continuous large-scale stream of training data, existing HD maps can be used as labels for onboard sensor data from consumer vehicle fleets. However, compared to current, well curated HD map perception datasets, this fleet data suffers from localization errors, resulting in distorted map labels. We introduce three kinds of localization errors, Ramp, Gaussian, and Perlin noise, to examine their influence on generated map labels. We train a variant of MapTRv2, a state-of-the-art online HD map construction model, on the Argoverse 2 dataset with various levels of localization errors and assess the degradation of model performance. Since localization errors affect distant labels more severely, but are also less significant to driving performance, we introduce a distance-based map construction metric. Our experiments reveal that localization noise affects the model performance significantly. We demonstrate that errors in heading angle exert a more substantial influence than position errors, as angle errors result in a greater distortion of labels as distance to the vehicle increases. Furthermore, we can demonstrate that the model benefits from non-distorted ground truth (GT) data and that the performance decreases more than linearly with the increase in noisy data. Our study additionally provides a qualitative evaluation of the extent to which localization errors influence the construction of HD maps.

[CV-123] Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

【速读】:该论文旨在解决人机交互中实现类人AI伴侣所面临的三大挑战:(1)在持续流式输入下实现低延迟推理,(2)自主决定响应时机,(3)在实时约束条件下控制生成内容的质量与数量。其解决方案的关键在于提出Proact-VL框架,该框架通过将多模态语言模型(Multimodal Language Models, MLMs)重构为具有主动性和实时交互能力的智能体,实现了对环境的类人感知与响应,并结合所构建的Live Gaming Benchmark数据集,在独白解说、协作解说和用户引导三种场景中验证了其在响应延迟、内容质量及视频理解能力上的优越性能,从而展现出在实时交互应用中的实际可行性。

链接: https://arxiv.org/abs/2603.03447
作者: Weicai Yan,Yuhong Dai,Qi Ran,Haodong Li,Wang Lin,Hao Liao,Xing Xie,Tao Jin,Jianxun Lian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

[CV-124] Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

【速读】:该论文旨在解决当前多模态医学视觉问答(Medical VQA)评估中对视觉依赖性(visual dependence)测量不足的问题,即现有基于准确率的评价协议可能无法有效检测模型是否真正依赖图像信息进行推理,从而导致模型通过“捷径学习”(shortcut exploitation)获得高分但缺乏视觉 grounding。解决方案的关键在于提出一种反事实评估框架(counterfactual evaluation framework),通过在真实图像、空白图像和打乱图像三种条件下测试模型性能,引入视觉依赖评分(Visual Reliance Score, VRS)、图像敏感度(Image Sensitivity, IS)以及幻觉视觉推理率(Hallucinated Visual Reasoning Rate, HVRR)等指标,量化模型对图像的依赖程度与视觉推理的合理性。实验证明,仅以准确率为奖励信号的文本强化学习(text-only RLVR)虽能提升准确率,却显著削弱了视觉依赖性,甚至在某些任务中因使用错位图像表现更好,而加入图像信息的训练虽提升准确性但仍存在大量无依据的视觉声称,表明当前评估体系亟需引入显式的视觉依赖约束机制。

链接: https://arxiv.org/abs/2603.03437
作者: Anas Zafar,Leema Krishna Murali,Ashish Vashist
机构: The University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心); Cohere Labs Community; Eisai Inc. (卫材公司); CORD.ai; Indian Institute of Science, Bangalore (印度科学理工学院班加罗尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, 2 tables, medical VQA / multimodal reasoning evaluation

点击查看摘要

Abstract:Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall despite improving accuracy. On VQA-RAD, both variants achieve 63% accuracy through different mechanisms: text-only RLVR retains 81% performance with blank images, while image-text RLVR shows only 29% image sensitivity. Models generate visual claims in 68-74% of responses, yet 38-43% are ungrounded (HVRR). These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.

[CV-125] mHC-HSI: Clustering-Guided Hyper-Connection Mamba for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(HSI)分类中传统残差连接在空间-光谱特征学习能力不足、模型可解释性弱的问题。其核心解决方案是提出一种聚类引导的流形约束超连接Mamba模型(mHC-HSI),关键创新在于:首先设计了一种基于mHC框架的聚类引导Mamba模块,显式联合学习HSI的空间与光谱信息;其次,通过重构残差矩阵为软聚类隶属度图,将复杂异质的HSI分解为更小的簇,提升了mHC方法的可解释性;最后,利用物理意义明确的光谱波段分组作为mHC中的“并行流”,增强了模型的物理可解释性与性能。

链接: https://arxiv.org/abs/2603.03418
作者: Yimin Zhu,Zack Dewis,Quinn Ledingham,Saeid Taleghanidoozdoozan,Mabel Heffring,Zhengsen Xu,Motasem Alkayid,Megan Greenwood,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学); University of Jordan (约旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, DeepSeek has invented the manifold-constrained hyper-connection (mHC) approach which has demonstrated significant improvements over the traditional residual connection in deep learning models \citexie2026mhc. Nevertheless, this approach has not been tailor-designed for improving hyperspectral image (HSI) classification. This paper presents a clustering-guided mHC Mamba model (mHC-HSI) for enhanced HSI classification, with the following contributions. First, to improve spatial-spectral feature learning, we design a novel clustering-guided Mamba module, based on the mHC framework, that explicitly learns both spatial and spectral information in HSI. Second, to decompose the complex and heterogeneous HSI into smaller clusters, we design a new implementation of the residual matrix in mHC, which can be treated as soft cluster membership maps, leading to improved explainability of the mHC approach. Third, to leverage the physical spectral knowledge, we divide the spectral bands into physically-meaningful groups and use them as the “parallel streams” in mHC, leading to a physically-meaningful approach with enhanced interpretability. The proposed approach is tested on benchmark datasets in comparison with the state-of-the-art methods, and the results suggest that the proposed model not only improves the accuracy but also enhances the model explainability. Code is available here: this https URL

[CV-126] Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

【速读】:该论文旨在解决机器学习中密度聚合(density aggregation)这一核心问题,特别是在深度集成(Deep Ensemble)等场景下如何最优地组合多个预测分布。其解决方案的关键在于引入并系统分析了广义均值(generalized mean)的归一化形式,参数为 $ r \in \mathbb{R} \cup {-\infty, +\infty} $,并通过对数似然(log-likelihood)这一标准评估指标进行理论建模,从而统一了线性池化(probability averaging, $ r=1 $)与几何池化(logit averaging, $ r=0 $)等传统方法。研究发现,仅当 $ r \in [0,1] $ 时,聚合规则能保证相对于单个分布的系统性改进,这为线性和几何池化提供了理论依据,并揭示了超出此范围的聚合策略可能失效的反例。

链接: https://arxiv.org/abs/2603.04204
作者: Raphaël Razafindralambo,Rémy Sun,Frédéric Precioso,Damien Garreau,Pierre-Alexandre Mattei
机构: Université Côte d’Azur; Inria; CNRS; I3S/LJAD; Maasai; Nice; Julius-Maximilians-Universität Würzburg; Institute for Computer Science / CAIDAS; Würzburg, Germany
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order r \in \mathbbR \cup -\infty,+\infty\ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime r \in [0,1] is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ( r=1 ) and geometric ( r=0 ) pooling. In contrast, we show that aggregation rules with r \notin [0,1] may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.

[CV-127] Polyp Segmentation Using Wavelet-Based Cross-Band Integration for Enhanced Boundary Representation

【速读】:该论文旨在解决结直肠癌早期筛查中息肉(polyp)边界分割不准确的问题,尤其针对黏膜对比度低、光照不均以及息肉与周围组织颜色相似导致的边界模糊难题。传统仅依赖RGB信息的方法在弱对比度和结构模糊场景下性能受限。解决方案的关键在于通过分析小波域中的息肉-背景对比度,发现灰度图像在所有频段均能保持比RGB图像更高的边界对比度,从而提出一种融合灰度与RGB表示的分割模型,利用互补的频率一致性交互机制,在提升边界精度的同时维持结构连贯性,显著优于现有方法。

链接: https://arxiv.org/abs/2603.03682
作者: Haesung Oh,Jaesung Lee
机构: Chung-Ang University (中央大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 39th Annual Conference on Neural Information Processing Systems in Europe (EurIPS 2025) Workshop, Copenhagen, Denmark, 2-7 December 2025 MedEurIPS:Medical Imagine Meets EurIPS

点击查看摘要

Abstract:Accurate polyp segmentation is essential for early colorectal cancer detection, yet achieving reliable boundary localization remains challenging due to low mucosal contrast, uneven illumination, and color similarity between polyps and surrounding tissue. Conventional methods relying solely on RGB information often struggle to delineate precise boundaries due to weak contrast and ambiguous structures between polyps and surrounding mucosa. To establish a quantitative foundation for this limitation, we analyzed polyp-background contrast in the wavelet domain, revealing that grayscale representations consistently preserve higher boundary contrast than RGB images across all frequency bands. This finding suggests that boundary cues are more distinctly represented in the grayscale domain than in the color domain. Motivated by this finding, we propose a segmentation model that integrates grayscale and RGB representations through complementary frequency-consistent interaction, enhancing boundary precision while preserving structural coherence. Extensive experiments on four benchmark datasets demonstrate that the proposed approach achieves superior boundary precision and robustness compared to conventional models.

人工智能

[AI-0] A Dual-Helix Governance Approach Towards Reliable Agent ic AI for WebGIS Development

【速读】:该论文旨在解决生成式 AI 在 WebGIS(网络地理信息系统)开发中因大语言模型(Large Language Model, LLM)固有局限性而导致的可靠性问题,这些局限包括上下文约束、跨会话遗忘、随机性、指令失效和适应刚性。解决方案的关键在于提出一种双螺旋治理框架(dual-helix governance framework),将上述挑战重构为结构化治理问题,并通过一个三轨架构(知识、行为、技能)实现:以知识图谱为底层支撑来外化领域事实并强制执行可执行协议,从而稳定系统运行;同时引入自学习循环实现自主知识增长。实证表明,该方法在重构 FutureShorelines WebGIS 工具时显著降低了代码复杂度(环路复杂度减少51%)并提升了可维护性(维护指数提升7点),验证了外部化治理机制对地学工程场景下操作可靠性的决定性作用。

链接: https://arxiv.org/abs/2603.04390
作者: Boyuan(Keven)Guan,Wencong Cui,Levente Juhasz
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Paper submitted to Transactions in GIS

点击查看摘要

Abstract:WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model capacity alone cannot resolve. We implement the framework as a 3-track architecture (Knowledge, Behavior, Skills) that uses a knowledge graph substrate to stabilize execution by externalizing domain facts and enforcing executable protocols, complemented by a self-learning cycle for autonomous knowledge growth. Applying this to the FutureShorelines WebGIS tool, a governed agent refactored a 2,265-line monolithic codebase into modular ES6 components. Results demonstrated a 51% reduction in cyclomatic complexity and a 7-point increase in maintainability index. A comparative experiment against a zero-shot LLM confirms that externalized governance, not just model capability, drives operational reliability in geospatial engineering. This approach is implemented in the open-source AgentLoom governance toolkit.

[AI-1] Low-Resource Guidance for Controllable Latent Audio Diffusion ICASSP2026

【速读】:该论文旨在解决生成式音频(Generative Audio)中精细可控输出的难题,尤其是现有基于引导(guidance)的方法在推理阶段计算开销大、效率低的问题。其核心挑战在于传统引导方法因每步都需要解码器反向传播(decoder backpropagation)而导致高成本。解决方案的关键在于提出一种基于选择性TFR(Selective TFG)和潜在控制头(Latent-Control Heads, LatCHs)的引导机制:LatCHs直接在潜在空间(latent space)操作,避免了昂贵的解码步骤,同时仅需少量训练资源(约7M参数,约4小时训练),实现了对音强、音高和节拍等属性的有效控制,且保持高质量音频生成,显著降低了计算成本。

链接: https://arxiv.org/abs/2603.04366
作者: Zachary Novack,Zack Zukowski,CJ Carr,Julian Parker,Zach Evans,Josiah Taylor,Taylor Berg-Kirkpatrick,Julian McAuley,Jordi Pons
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textite.g., guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and \approx 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at this https URL.

[AI-2] Dissecting Quantization Error: A Concentration-Alignment Perspective

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化(Post-Training Quantization, PTQ)过程中因精度下降而导致性能损失的问题。现有方法通常依赖于函数保持变换(如旋转、Hadamard变换或通道缩放)来改善量化误差,但缺乏理论指导。作者通过信号-量化噪声比(Signal-to-Quantization-Noise Ratio, SQNR)分析揭示:在固定比特宽度下,量化误差可分解为两个关键因素——权重与激活的集中度(concentration,反映分布的分散性和异常值)以及其主变化方向的一致性(alignment)。这表明仅优化集中度不足以最小化误差,提升权重与激活主方向的对齐性同样重要。为此,作者提出轻量级的块级集中-对齐变换(Block Concentration-Alignment Transforms, CAT),利用小规模校准集估计协方差矩阵,联合优化集中度与对齐性,从而近似最大化SQNR。实验表明,在4-bit精度下,CAT在多个LLMs上均达到或超越现有基于变换的量化方法,验证了该理论框架的有效性。

链接: https://arxiv.org/abs/2603.04359
作者: Marco Federici,Boris van Breugel,Paul Whatmough,Markus Nagel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-preserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This reveals an actionable insight: beyond concentration - the focus of most prior transforms (e.g. rotations or Hadamard) - improving alignment between weight and activation can further reduce quantization error. Motivated by this, we introduce block Concentration-Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision, confirming the insights gained in our framework.

[AI-3] RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots ICLR2026

【速读】:该论文旨在解决当前机器人学习领域缺乏可复现、大规模基准测试的问题,以系统评估通用机器人(generalist robots)在人类环境中执行日常任务的能力。其解决方案的关键在于提出RoboCasa365,这是一个基于RoboCasa平台的综合性仿真基准,涵盖2,500个多样化厨房环境中的365项日常任务,包含超过600小时的人类示范数据和1600小时合成生成的示范数据,从而为多任务学习、机器人基础模型训练及终身学习等不同场景提供系统性评估支持。

链接: https://arxiv.org/abs/2603.04356
作者: Soroush Nasiriany,Sepehr Nasiriany,Abhiram Maddukuri,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026; First three authors contributed equally

点击查看摘要

Abstract:Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data – making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.

[AI-4] Efficient Refusal Ablation in LLM through Optimal Transport

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型中安全对齐机制被激活空间劫持(activation-based jailbreaking)方法绕过的问题。现有方法通常通过正交投影移除拒绝方向,但忽略了模型激活的分布结构,导致攻击效率受限。其解决方案的关键在于引入基于最优传输理论(optimal transport theory)的框架,将有害激活的整体分布映射至无害激活分布,结合主成分分析(PCA)与闭式高斯最优传输,在高维表示空间中实现高效计算并保留关键几何结构;特别地,层选择性干预(仅对网络深度约40–60%处的1–2层施加最优传输)显著优于全网络干预,揭示了拒绝机制可能局部化而非全局分布,为理解安全表征的几何特性提供了新视角,并指出当前对齐方法易受分布级攻击。

链接: https://arxiv.org/abs/2603.04355
作者: Geraldin Nanfack,Eugene Belilovsky,Elvis Dohmatob
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

[AI-5] SpotIt: Verification-based Text-to-SQL Evaluation with Database Constraints

【速读】:该论文旨在解决当前Text-to-SQL系统评估中依赖静态测试集导致的局限性问题,即标准测试集难以发现生成SQL与黄金SQL之间在实际数据库实例中的语义差异。为应对这一挑战,论文提出SpotIt+,其核心解决方案是基于有界等价验证(bounded equivalence verification)的主动搜索机制,并引入约束挖掘流水线(constraint-mining pipeline),该流程结合规则驱动的规范挖掘与大语言模型(LLM)验证,从而生成更贴近实际场景的差异化数据库实例。此方法显著提升了对生成SQL与正确SQL之间细微语义差别的检测能力,同时保持了高效性。

链接: https://arxiv.org/abs/2603.04334
作者: Rocky Klopfenstein,Yang He,Andrew Tremante,Yuepeng Wang,Nina Narodytska,Haoze Wu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.

[AI-6] What Does Flow Matching Bring To TD Learning?

【速读】:该论文旨在解决强化学习中传统价值函数估计方法(monolithic critics)在高更新次数密度(high-UTD)在线学习场景下因缺乏塑性(plasticity)而导致性能下降的问题。其核心解决方案是引入基于流匹配(flow matching)的批评家架构,关键在于两个机制:一是通过积分过程中的**测试时恢复(test-time recovery)**机制,利用多步积分迭代削弱早期值估计误差,从而提升预测鲁棒性;二是通过对积分路径上多个插值点施加密集速度场监督(dense velocity supervision),促进网络内部特征学习的可塑性,使其能够适应非平稳的时序差分(TD)目标,同时保留先前知识并避免过拟合。实验证明,该方法在最终性能和样本效率上均显著优于传统批评家。

链接: https://arxiv.org/abs/2603.04333
作者: Bhavya Agrawalla,Michal Nauman,Aviral Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emphtest-time recovery, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emphplastic feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2 \times in final performance and around 5 \times in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.

[AI-7] Activation Outliers in Transformer Quantization: Reproduction Statistical Analysis and Deployment Tradeoffs EMNLP2021

【速读】:该论文旨在解决变压器(Transformer)模型在后训练量化(Post-training Quantization, PTQ)过程中因结构化激活异常值(structured activation outliers)导致的严重精度下降问题。研究表明,这种现象在BERT-base模型微调后的QNLI任务中尤为显著:当采用全局W8A8量化时,验证准确率从FP32基准的89.66%骤降至54.33%,降幅达35.33个百分点。核心发现是,深层网络激活呈现强重尾分布(kurtosis高达271),且约55%的能量集中于顶部1%的通道,表明这些大激活值并非噪声而是携带结构化信号。解决方案的关键在于引入通道感知的精度分配策略,而非仅依赖标量裁剪(scalar clipping)。具体而言,混合精度PTQ可恢复至接近FP32基线(89.42%),而基于嵌入组的分组量化(Per-embedding-group, PEG)通过优化分组结构亦能显著提升性能(如四组时达86.18%),这说明有效缓解PTQ失效需针对通道级特征进行细粒度精度管理。

链接: https://arxiv.org/abs/2603.04308
作者: Pranav Kumar Kaliaperumal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 tables. Reproducible study of transformer PTQ activation outliers based on Bondarenko et al. (EMNLP 2021, Qualcomm AI Research). Code: this https URL

点击查看摘要

Abstract:Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in work associated with Qualcomm AI Research. This paper provides a reproducible empirical reproduction and systems-level extension of that phenomenon in BERT-base fine-tuned on QNLI. When global W8A8 quantization is applied, validation accuracy drops sharply from 89.66% (FP32) to 54.33%, a decrease of 35.33 points. Statistical analysis of FP32 activations shows strongly heavy-tailed behavior that intensifies with model depth: kurtosis reaches 271 in the final layers and approximately 55% of activation energy is concentrated in the top 1% of channels. We evaluate several mitigation strategies. Mixed precision PTQ restores accuracy close to the FP32 baseline (89.42%). Per-embedding-group (PEG) quantization shows strong sensitivity to grouping structure, improving accuracy from 66.12% with three groups to 86.18% with four groups. In contrast, percentile-based calibration, even at thresholds between 99.0 and 99.99, fails to recover accuracy (about 50.54%), indicating that large activation channels encode structured signal rather than rare noise. Deployment profiling on an RTX 3050 GPU shows minimal differences in latency and memory usage across methods (median latency about 58-59 ms; VRAM usage about 484-486 MB), highlighting the importance of hardware-aware evaluation. Overall, the results show that PTQ failure in transformers is primarily driven by structured channel dominance amplified through residual connections. Effective mitigation therefore requires channel-aware precision allocation rather than scalar clipping alone.

[AI-8] IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

【速读】:该论文旨在解决基于决策变换器(Decision Transformer)的离线强化学习(Offline Reinforcement Learning, Offline RL)方法在实际应用中面临的两大瓶颈问题:一是模型难以有效利用低质量或次优的经验数据,二是缺乏显式的策略规划能力以生成最优行为序列。针对这些问题,论文提出了一种名为“想象规划蒸馏”(Imaginary Planning Distillation, IPD)的新框架,其核心创新在于将离线规划机制无缝嵌入到数据生成、监督训练和在线推理三个阶段:首先从离线数据中学习一个带有不确定性估计的世界模型和准最优价值函数(quasi-optimal value function),进而利用该模型通过模型预测控制(Model Predictive Control, MPC)生成可靠的想象最优轨迹,对原始次优轨迹进行增强;随后,在Transformer架构上训练顺序策略时引入基于价值引导的目标,实现最优策略的蒸馏。关键突破在于用学习得到的准最优价值函数替代传统手动调参的“返回目标”(return-to-go),从而显著提升推理阶段的决策稳定性和整体性能。

链接: https://arxiv.org/abs/2603.04289
作者: Yihao Qin,Yuanfei Wang,Hang Zhou,Peiran Liu,Hao Dong,Yiding Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose \textbfImaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer-based sequential policy is then trained on this enriched dataset, complemented by a value-guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.

[AI-9] VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments

【速读】:该论文旨在解决在无GPS或通信降级环境中,基于视觉语言模型(Vision-Language Models, VLMs)的自主空中机器人因缺乏相机元数据和遥测信息而无法恢复场景绝对度量尺度的问题。实验表明,当前五种主流VLM在空间尺度估计上存在显著幻觉现象,中位面积误差超过50%。解决方案的关键在于提出一种轻量级、确定性的几何感知技能——VANGUARD,其通过检测图像中的小车辆并利用定向边界框(oriented bounding boxes)提取其模态像素长度(通过核密度估计获得),再结合预标定参考长度转换为地面采样距离(Ground Sample Distance, GSD),从而实现高精度的尺度恢复。该工具返回GSD估计值及综合置信度分数,使调用者可自主判断是否信任结果,避免错误决策。在DOTA v1.5基准上,VANGUARD实现6.87%的中位GSD误差;集成SAM分割后,在100张图像上的面积测量中位误差降至19.7%,且类别依赖性降低2.6倍、灾难性失败减少4倍,验证了赋予代理确定性几何工具对安全自主空间推理的重要性。

链接: https://arxiv.org/abs/2603.04277
作者: Yifei Chen,Xupeng Chen,Feng Wang,Niangang Jiao,Jiayin Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical – yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark – with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline – demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.

[AI-10] When AI Fails What Works? A Data-Driven Taxonomy of Real-World AI Risk Mitigation Strategies

【速读】:该论文旨在解决生成式 AI (Generative AI) 在高风险应用场景中因系统性故障引发的连锁失效问题,这些问题已超越单一模型错误,演变为法律、声誉及财务层面的系统性风险。其解决方案的关键在于构建一个基于真实世界AI事故报告的实证驱动型缓解措施分类体系(taxonomy),通过分析9,705篇媒体报道中的6,893条具体应对行动,扩展MIT现有AI风险缓解分类框架,新增四类关键缓解策略:纠正与限制措施、法律/监管与执法措施、金融经济与市场控制、回避与否认策略,并对23,994个标签进行精细化标注,其中67%为此前未见的缓解模式,显著提升该分类体系对新兴系统性失效场景的覆盖能力,从而强化从“诊断到处方”的闭环指导机制,推动部署后持续监控以预防级联事故和下游影响。

链接: https://arxiv.org/abs/2603.04259
作者: Evgenija Popchanovska,Ana Gjorgjevikj,Maryan Rizinski,Lubomir Chitkushev,Irena Vodenska,Dimitar Trajanov
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in high-stakes workflows, where failures propagate beyond isolated model errors into systemic breakdowns that can lead to legal exposure, reputational damage, and material financial losses. Building on this shift from model-centric risks to end-to-end system vulnerabilities, we analyze real-world AI incident reporting and mitigation actions to derive an empirically grounded taxonomy that links failure dynamics to actionable interventions. Using a unified corpus of 9,705 media-reported AI incident articles, we extract explicit mitigation actions from 6,893 texts via structured prompting and then systematically classify responses to extend MIT’s AI Risk Mitigation Taxonomy. Our taxonomy introduces four new mitigation categories, including 1) Corrective and Restrictive Actions, 2) Legal/Regulatory and Enforcement Actions, 3) Financial, Economic, and Market Controls, and 4) Avoidance and Denial, capturing response patterns that are becoming increasingly prevalent as AI deployment and regulation evolve. Quantitatively, we label the mitigation dataset with 32 distinct labels, producing 23,994 label assignments; 9,629 of these reflect previously unseen mitigation patterns, yielding a 67% increase of the original subcategory coverage and substantially enhancing the taxonomy’s applicability to emerging systemic failure modes. By structuring incident responses, the paper strengthens “diagnosis-to-prescription” guidance and advances continuous, taxonomy-aligned post-deployment monitoring to prevent cascading incidents and downstream impact.

[AI-11] Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback

【速读】:该论文旨在解决多层分层推理系统中在长期资源约束和仅终端反馈条件下,如何学习最优任务路由策略的问题。其核心挑战在于递归定义的损失结构与稀疏、策略依赖的反馈机制导致重要性加权估计器方差放大,进而引发学习不稳定。解决方案的关键在于提出一种基于EXP4的方差缩减算法,并融合Lyapunov优化技术,实现无偏损失估计与稳定学习,同时提供相对于最优固定路由策略的后悔界保证,在随机到达和资源约束下展现出近优性能。

链接: https://arxiv.org/abs/2603.04247
作者: Haoran Zhang,Seohyeon Cha,Hasan Burhan Beytur,Kevin S Chan,Gustavo de Veciana,Haris Vikalo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Hierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.

[AI-12] Agent ics 2.0: Logical Transduction Algebra for Agent ic Data Workflows

【速读】:该论文旨在解决生成式 AI (Generative AI) 在企业级部署中面临的软件质量属性不足问题,如可靠性、可扩展性和可观测性,而不仅仅是实现合理的文本生成。其解决方案的关键在于提出 Agentics 2.0 框架,通过逻辑传递代数(logical transduction algebra)将大语言模型(LLM)推理调用形式化为带类型的语义变换,称为可传递函数(transducible function),该函数强制执行模式有效性与证据局部性;同时利用代数基础的组合算子构建可组合程序,并以无状态异步调用方式在 Map-Reduce 程序中并行执行,从而在类型安全、可解释性、证据追踪和并行扩展方面实现系统级保障。

链接: https://arxiv.org/abs/2603.04241
作者: Alfio Massimiliano Gliozzo,Junkyu Lee,Nahuel Defosse
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Agentic AI is rapidly transitioning from research prototypes to enterprise deployments, where requirements extend to meet the software quality attributes of reliability, scalability, and observability beyond plausible text generation. We present Agentics 2.0, a lightweight, Python-native framework for building high-quality, structured, explainable, and type-safe agentic data workflows. At the core of Agentics 2.0, the logical transduction algebra formalizes a large language model inference call as a typed semantic transformation, which we call a transducible function that enforces schema validity and the locality of evidence. The transducible functions compose into larger programs via algebraically grounded operators and execute as stateless asynchronous calls in parallel in asynchronous Map-Reduce programs. The proposed framework provides semantic reliability through strong typing, semantic observability through evidence tracing between slots of the input and output types, and scalability through stateless parallel execution. We instantiate reusable design patterns and evaluate the programs in Agentics 2.0 on challenging benchmarks, including DiscoveryBench for data-driven discovery and Archer for NL-to-SQL semantic parsing, demonstrating state-of-the-art performance.

[AI-13] PRAM-R: A Perception-Reasoning -Action-Memory Framework with LLM -Guided Modality Routing for Adaptive Autonomous Driving

【速读】:该论文旨在解决自动驾驶中多模态感知(Multimodal Perception)带来的冗余计算开销问题,即在所有传感器持续激活时导致的资源浪费。其核心解决方案是提出PRAM-R框架,该框架采用异步双环设计:快速反应环用于实时感知与控制,慢速反思环则基于大语言模型(LLM)引导的模态路由机制进行自适应的传感器选择与记忆更新。关键创新在于引入LLM驱动的模态选择策略,结合环境上下文与传感器诊断信息动态调整各模态权重,并通过分层记忆模块保障时序一致性与长期适应能力,从而实现高效、鲁棒且自适应的多模态感知系统。

链接: https://arxiv.org/abs/2603.04222
作者: Yi Zhang,Xian Zhang,Saisi Zhao,Yinglei Song,Chengdong Wu,Nenad Petrovic,Alois Knoll
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal perception enables robust autonomous driving but incurs unnecessary computational cost when all sensors remain active. This paper presents PRAM-R, a unified Perception-Reasoning-Action-Memory framework with LLM-Guided Modality Routing for adaptive autonomous driving. PRAM-R adopts an asynchronous dual-loop design: a fast reactive loop for perception and control, and a slow deliberative loop for reasoning-driven modality selection and memory updates. An LLM router selects and weights modalities using environmental context and sensor diagnostics, while a hierarchical memory module preserves temporal consistency and supports long-term adaptation. We conduct a two-stage evaluation: (1) synthetic stress tests for stability analysis and (2) real-world validation on the nuScenes dataset. Synthetic stress tests confirm 87.2% reduction in routing oscillations via hysteresis-based stabilization. Real-world validation on nuScenes shows 6.22% modality reduction with 20% memory recall while maintaining comparable trajectory accuracy to full-modality baselines in complex urban scenarios. Our work demonstrates that LLM-augmented architectures with hierarchical memory achieve efficient, adaptive multimodal perception in autonomous driving.

[AI-14] ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis INTERSPEECH2026

【速读】:该论文旨在解决低资源个性化语音合成中,使用零样本文本到语音(Zero-shot Text-to-Speech, ZS-TTS)生成数据进行数据增强时,因混合大量合成语音与有限真实录音而导致的说话人相似度下降问题。解决方案的关键在于提出ZeSTA框架——一个轻量级域条件训练机制,通过引入一个简单的域嵌入(domain embedding)来区分真实语音与合成语音,并结合真实数据过采样策略,在不修改基础模型架构的前提下,显著提升了在极端有限目标数据下的适应稳定性,从而在保持语音可懂度和感知质量的同时,有效改善了说话人特征保留能力。

链接: https://arxiv.org/abs/2603.04219
作者: Youngwon Choi,Jinwoo Oh,Hwayeon Kim,Hyeonyu Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 6 pages, submitted to INTERSPEECH 2026

点击查看摘要

Abstract:We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.

[AI-15] Noise-aware Client Selection for carbon-efficient Federated Learning via Gradient Norm Thresholding

【速读】:该论文旨在解决碳高效联邦学习(Carbon-efficient Federated Learning)中因客户端数据质量未知而导致的模型训练性能下降问题,尤其是在客户端选择策略受可再生能源波动影响时,如何兼顾模型收敛效率与可持续性。其解决方案的关键在于提出一种模块化增强方法,通过引入噪声客户端数据过滤机制,并结合探测轮次(probing rounds)的梯度范数阈值判定策略,有效识别并剔除含有噪声的数据源,从而提升模型鲁棒性和训练稳定性,同时优化碳预算下的训练效率。

链接: https://arxiv.org/abs/2603.04194
作者: Patrick Wilhelm,Inese Yilmaz,Odej Kao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large-scale Neural Networks requires substantial computational power and energy. Federated Learning enables distributed model training across geospatially distributed data centers, leveraging renewable energy sources to reduce the carbon footprint of AI training. Various client selection strategies have been developed to align the volatility of renewable energy with stable and fair model training in a federated system. However, due to the privacy-preserving nature of Federated Learning, the quality of data on client devices remains unknown, posing challenges for effective model training. In this paper, we introduce a modular approach on top to state-of-the-art client selection strategies for carbon-efficient Federated Learning. Our method enhances robustness by incorporating a noisy client data filtering, improving both model performance and sustainability in scenarios with unknown data quality. Additionally, we explore the impact of carbon budgets on model convergence, balancing efficiency and sustainability. Through extensive evaluations, we demonstrate that modern client selection strategies based on local client loss tend to select clients with noisy data, ultimately degrading model performance. To address this, we propose a gradient norm thresholding mechanism using probing rounds for more effective client selection and noise detection, contributing to the practical deployment of carbon-efficient Federated Learning.

[AI-16] owards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期、个性化用户交互中对复杂且多变偏好理解能力不足的问题,即如何有效评估LLM在真实场景下持续遵循用户偏好的表现。其解决方案的关键在于提出RealPref基准测试框架,该框架包含100个用户档案、1300条个性化偏好、四类偏好表达方式(从显式到隐式),以及长周期交互历史,并设计了三种类型的测试问题(选择题、是非题和开放题)及详细的评分标准用于LLM作为评判者(LLM-as-a-judge)的评估。实验证明,随着上下文长度增加和偏好表达变得更为隐含,LLM性能显著下降,且将偏好理解泛化至未见场景仍是重大挑战,从而为未来开发具备用户感知能力的个性化LLM助手提供了基础。

链接: https://arxiv.org/abs/2603.04191
作者: Qianyun Guo,Yibo Li,Yue Liu,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow these preferences in realistic, long-term situations remains underexplored. This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions. RealPref features 100 user profiles, 1300 personalized preferences, four types of preference expression (ranging from explicit to implicit), and long-horizon interaction histories. It includes three types of test questions (multiple-choice, true-or-false, and open-ended), with detailed rubrics for LLM-as-a-judge evaluation. Results indicate that LLM performance significantly drops as context length grows and preference expression becomes more implicit, and that generalizing user preference understanding to unseen scenarios poses further challenges. RealPref and these findings provide a foundation for future research to develop user-aware LLM assistants that better adapt to individual needs. The code is available at this https URL.

[AI-17] CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts

【速读】:该论文旨在解决传统日志分析方法在入侵检测与取证调查中自动化程度低的问题,其核心瓶颈在于现有方法依赖领域特定配置(如专家定义的检测规则、手工设计的日志解析器和特征工程),难以实现对日志的语义理解及攻击原因解释。解决方案的关键是引入生成式 AI(Generative AI)技术,利用大语言模型(Large Language Models, LLMs)实现跨领域、跨格式的日志语义解析,并通过构建一个大规模、多源异构且标注详尽的攻击表现日志数据集(Cyber Attack Manifestation Log Data Set, CAM-LDS),为 LLM 提供高质量训练与评估基础,从而提升日志分析的自动化水平和准确性。

链接: https://arxiv.org/abs/2603.04186
作者: Max Landauer,Wolfgang Hotwagner,Thorina Boenke,Florian Skopik,Markus Wurzenberger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Log data are essential for intrusion detection and forensic investigations. However, manual log analysis is tedious due to high data volumes, heterogeneous event formats, and unstructured messages. Even though many automated methods for log analysis exist, they usually still rely on domain-specific configurations such as expert-defined detection rules, handcrafted log parsers, or manual feature-engineering. Crucially, the level of automation of conventional methods is limited due to their inability to semantically understand logs and explain their underlying causes. In contrast, Large Language Models enable domain- and format-agnostic interpretation of system logs and security alerts. Unfortunately, research on this topic remains challenging, because publicly available and labeled data sets covering a broad range of attack techniques are scarce. To address this gap, we introduce the Cyber Attack Manifestation Log Data Set (CAM-LDS), comprising seven attack scenarios that cover 81 distinct techniques across 13 tactics and collected from 18 distinct sources within a fully open-source and reproducible test environment. We extract log events that directly result from attack executions to facilitate analysis of manifestations concerning command observability, event frequencies, performance metrics, and intrusion detection alerts. We further present an illustrative case study utilizing an LLM to process the CAM-LDS. The results indicate that correct attack techniques are predicted perfectly for approximately one third of attack steps and adequately for another third, highlighting the potential of LLM-based log interpretation and utility of our data set.

[AI-18] Architectural Proprioception in State Space Models: Thermodynamic Training Induces Anticipatory Halt Detection

【速读】:该论文旨在解决神经网络在推理过程中缺乏内在计算效率感知与自适应停止机制的问题,尤其关注如何使模型具备对自身计算状态的元认知能力,从而实现更高效、可解释且成本敏感的推理。解决方案的关键在于提出概率导航架构(Probability Navigation Architecture, PNA),通过引入基于热力学原理的新型损失函数,在训练状态空间模型(State Space Models, SSMs)和Transformer时同时惩罚信息浪费与标准交叉熵误差;其中,SSMs展现出一种普适的停止签名(Universal Stopping Signature, USS),即递归状态熵与停机置信度之间存在强预测性耦合(r = -0.836, p < 0.001),且该耦合可通过热力学压力调控,体现其天然具备计算自我意识的特性,而Transformer则无此现象,表明该机制具有架构依赖性。

链接: https://arxiv.org/abs/2603.04180
作者: Jay Noon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 15 figures

点击查看摘要

Abstract:We introduce the Probability Navigation Architecture (PNA) framework, which treats neural computation as navigation through a probability manifold governed by thermodynamic principles. We train State Space Models (SSMs) and Transformers with a novel thermodynamic loss function that penalizes computational waste alongside standard cross-entropy. Across 19 experimental phases, we discover that thermodynamically-trained SSMs develop architectural proprioception: a strong anticipatory coupling between recurrent state entropy and halt confidence (r = -0.836, p 0.001) in which the halt signal leads state entropy collapse by exactly two tokens (tau = -2.0). This Universal Stopping Signature (USS) reproduces to four decimal places across random seeds and generalizes to a structurally distinct sorting task. Critically, Transformers trained identically show no such coupling (r = -0.07), demonstrating that the phenomenon is architecture-dependent. Cross-task transfer experiments confirm that SSM halt detection reflects genuine meta-cognition (zero-shot transfer F1: SSMs 64.2% vs. Transformers 69.3%; post-adaptation: SSMs 94.5% vs. Transformers 86.4%), while Transformer halt detection relies on syntactic pattern matching. A 2D hyperparameter sweep over energy penalty (alpha) and halt supervision (beta) reveals that the anticipatory coupling is continuously controllable through training, with thermodynamic pressure serving as the primary induction mechanism and explicit halt supervision as an amplifier. Our results establish that SSMs are thermodynamically native architectures whose fixed-size recurrent states naturally support the Markovian compression that enables computational self-awareness, with implications for cost-aware inference, dynamic token budgets, and confidence-based routing in production systems.

[AI-19] CodeTaste: Can LLM s Generate Human-Level Code Refactorings?

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)编码代理在生成代码时积累复杂性、重复和架构债务的问题,试图验证LLM代理是否能够可靠地执行重构(refactoring)操作,并识别人类开发者在真实代码库中实际选择的重构方式。其解决方案的关键在于提出CodeTaste基准测试集,该数据集从开源仓库的大规模多文件变更中挖掘重构任务,并结合仓库测试套件与基于数据流推理的定制静态检查来量化评估重构效果。实验表明,当前前沿模型在详细指定重构动作时表现良好,但在仅提供改进区域的情况下难以发现人类选择的重构方案;通过“提议-实现”分解策略并优选最匹配的重构提案,可显著提升与人类重构决策的一致性,从而为对齐编码代理与人类开发实践提供了新的评估目标和偏好信号。

链接: https://arxiv.org/abs/2603.04177
作者: Alex Thillen,Niels Mündler,Veselin Raychev,Martin Vechev
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. In this paper, we investigate if LLM agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. We present CodeTaste, a benchmark of refactoring tasks mined from large-scale multi-file changes in open-source repositories. To score solutions, we combine repository test suites with custom static checks that verify removal of undesired patterns and introduction of desired patterns using dataflow reasoning. Our experimental results indicate a clear gap across frontier models: agents perform well when refactorings are specified in detail, but often fail to discover the human refactoring choices when only presented with a focus area for improvement. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.04177 [cs.SE] (or arXiv:2603.04177v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.04177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] GarmentPile: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning ICRA2026

【速读】:该论文旨在解决现实场景中衣物堆叠状态下难以准确、安全地执行单件衣物检索的问题,这是当前家庭服务机器人在衣物操作任务(如折叠、悬挂、穿戴)中面临的关键瓶颈。其解决方案的核心在于构建一个融合视觉-语言模型(VLM)与视觉可操作性感知的端到端衣物检索流水线:首先利用SAM2分割模型对堆叠衣物进行精细掩码分割,并通过掩码微调机制提升分割质量以增强VLM的语义理解能力;其次,结合双臂协作框架应对大件或易垂坠衣物的抓取挑战,从而确保每次操作仅成功提取一件衣物,为后续高阶任务提供稳定输入。

链接: https://arxiv.org/abs/2603.04158
作者: Mingleyang Li,Yuran Wang,Yue Chen,Tianxing Chen,Jiaqi Liang,Zishun Shen,Haoran Lu,Ruihai Wu,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA2026 Accepted

点击查看摘要

Abstract:Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM’s comprehensive awareness of each garment’s state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: this https URL.

[AI-21] Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO)在大规模语言模型(LLM)推理训练中因大量分组采样导致的计算开销过高问题,同时避免现有选择性数据利用方法因改变原始采样分布而引入估计偏差、损害理论收敛性的缺陷。解决方案的关键在于提出Dynamic Pruning Policy Optimization (DPPO),其核心创新是通过重要性采样(importance sampling)机制对剪枝后的梯度进行无偏修正,并引入数学推导的重缩放因子以保持与全批量基线相同的优化目标;此外,为缓解剪枝带来的数据稀疏问题,设计了基于窗口的密集提示打包策略(Dense Prompt Packing),最大化有效token密度和硬件利用率,从而实现高效且稳定的加速训练。

链接: https://arxiv.org/abs/2603.04135
作者: Haodong Zhu,Yangyang Ren,Yanjing Li,Mingbao Lin,Linlin Yang,Xuhui Liu,Xiantong Zhen,Haiguang Liu,Baochang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37 \times training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.

[AI-22] Data-Aware Random Feature Kernel for Transformers

【速读】:该论文旨在解决Transformer模型中注意力机制因softmax内核的二次复杂度导致难以扩展的问题,尤其是在预训练模型中查询和键(queries and keys)通常呈现各向异性时,传统基于各向同性分布的随机特征采样方法会引入高蒙特卡洛方差,从而影响性能。解决方案的关键在于通过数据对齐(data alignment)调整softmax内核的几何结构,使得能够推导出一个可 tractable(可处理的)最小方差重要性采样提议分布,并在此基础上提出DARKFormer——一种具有数据感知随机特征核(data-aware random-feature kernel)的Transformer架构。该方法通过学习随机投影协方差矩阵,高效实现对数据对齐内核的重要性采样正随机特征估计器,在资源受限场景下显著提升了kernel-based attention的训练稳定性和性能表现,尤其在微调阶段(finetuning regimes)中缩小了与精确softmax注意力的差距。

链接: https://arxiv.org/abs/2603.04127
作者: Amirhossein Farzam,Hossein Mobahi,Nolan Andrew Miller,Luke Sernau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel with positive random features drawn from an isotropic distribution. In pretrained models, however, queries and keys are typically anisotropic. This induces high Monte Carlo variance in isotropic sampling schemes unless one retrains the model or uses a large feature budget. Importance sampling can address this by adapting the sampling distribution to the input geometry, but complex data-dependent proposal distributions are often intractable. We show that by data aligning the softmax kernel, we obtain an attention mechanism which can both admit a tractable minimal-variance proposal distribution for importance sampling, and exhibits better training stability. Motivated by this finding, we introduce DARKFormer, a Data-Aware Random-feature Kernel transformer that features a data-aligned kernel geometry. DARKFormer learns the random-projection covariance, efficiently realizing an importance-sampled positive random-feature estimator for its data-aligned kernel. Empirically, DARKFormer narrows the performance gap with exact softmax attention, particularly in finetuning regimes where pretrained representations are anisotropic. By combining random-feature efficiency with data-aware kernels, DARKFormer advances kernel-based attention in resource-constrained settings.

[AI-23] SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling

【速读】:该论文旨在解决自动驾驶测试中安全关键场景生成的难题,即如何在对抗性临界性(adversarial criticality)、物理可行性(physical feasibility)和行为真实性(behavioral realism)三者之间取得平衡。现有方法往往难以同时满足这三个目标,导致生成的场景要么不现实,要么无法有效触发系统失效。解决方案的关键在于提出SaFeR框架,其核心创新是基于可行性约束的token重采样策略:首先利用Transformer模型构建一个以自然驾驶分布为先验的真实感模型,并引入一种新颖的差分注意力机制以增强交互建模并抑制注意力噪声;在此基础上,通过在高概率信任区域内诱导对抗行为,同时结合由最大可行区域(Largest Feasible Region, LFR)定义的物理可行性约束,确保生成场景既具备挑战性又不会产生理论上必然发生的碰撞。该方法通过离线强化学习近似LFR,从而实现高效且可靠的场景生成。

链接: https://arxiv.org/abs/2603.04071
作者: Jinlong Cui,Fenghua Liang,Guo Yang,Chengcheng Tang,Jianxun Cui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-critical scenario generation is crucial for evaluating autonomous driving systems. However, existing approaches often struggle to balance three conflicting objectives: adversarial criticality, physical feasibility, and behavioral realism. To bridge this gap, we propose SaFeR: safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling. We first formulate traffic generation as a discrete next token prediction problem, employing a Transformer-based model as a realism prior to capture naturalistic driving distributions. To capture complex interactions while effectively mitigating attention noise, we propose a novel differential attention mechanism within the realism prior. Building on this prior, SaFeR implements a novel resampling strategy that induces adversarial behaviors within a high-probability trust region to maintain naturalism, while enforcing a feasibility constraint derived from the Largest Feasible Region (LFR). By approximating the LFR via offline reinforcement learning, SaFeR effectively prevents the generation of theoretically inevitable collisions. Closed-loop experiments on the Waymo Open Motion Dataset and nuPlan demonstrate that SaFeR significantly outperforms state-of-the-art baselines, achieving a higher solution rate and superior kinematic realism while maintaining strong adversarial effectiveness.

[AI-24] Sim2Sea: Sim-to-Real Policy Transfer for Maritime Vessel Navigation in Congested Waters

【速读】:该论文旨在解决复杂海上环境中自主航行的挑战,特别是由于船舶间交互复杂性和环境不确定性导致的仿真到现实(sim-to-real)迁移难题。现有方法因仿真精度不足、情境感知能力有限及探索策略不安全而在实际部署中表现不佳。解决方案的关键在于提出一个名为Sim2Sea的综合框架,其核心创新包括:(1)开发基于GPU加速的并行仿真器以实现高保真、可扩展的海上场景模拟;(2)设计双流时空策略网络结合速度障碍引导的动作掩码机制,提升对多模态感知和复杂动力学的处理能力,并保障探索过程的安全性与效率;(3)引入定向领域随机化策略有效缩小仿真与现实之间的差距。实验表明,该方法在仿真中收敛更快且轨迹更安全,且纯仿真训练的策略可零样本迁移到一艘17吨无人船的真实密集水域作业中,验证了其在实际自主海上导航中的可靠sim-to-real迁移能力。

链接: https://arxiv.org/abs/2603.04057
作者: Xinyu Cui,Xuanfa Jin,Xue Yan,Yongcheng Zeng,Luoyang Sun,Siying Wei,Ruizhi Zhang,Jian Zhao,Haifeng Zhang,Jun Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous navigation in congested maritime environments is a critical capability for a wide range of real-world applications. However, it remains an unresolved challenge due to complex vessel interactions and significant environmental uncertainties. Existing methods often fail in practical deployment due to a substantial sim-to-real gap, which stems from imprecise simulation, inadequate situational awareness, and unsafe exploration strategies. To address these, we propose \textbfSim2Sea, a comprehensive framework designed to bridge simulation and real-world execution. Sim2Sea advances in three key aspects. First, we develop a GPU-accelerated parallel simulator for scalable and accurate maritime scenario simulation. Second, we design a dual-stream spatiotemporal policy that handles complex dynamics and multi-modal perception, augmented with a velocity-obstacle-guided action masking mechanism to ensure safe and efficient exploration. Finally, a targeted domain randomization scheme helps bridge the sim-to-real gap. Simulation results demonstrate that our method achieves faster convergence and safer trajectories than established baselines. In addition, our policy trained purely in simulation successfully transfers zero-shot to a 17-ton unmanned vessel operating in real-world congested waters. These results validate the effectiveness of Sim2Sea in achieving reliable sim-to-real transfer for practical autonomous maritime navigation.

[AI-25] Inference-Time Toxicity Mitigation in Protein Language Models

【速读】:该论文旨在解决蛋白质语言模型(Protein Language Models, PLMs)在从头蛋白设计中可能引发的毒性风险问题,尤其是当模型通过领域自适应(domain adaptation)针对特定分类群(taxonomic groups)微调时,即便毒性并非训练目标,也可能诱发有毒蛋白的生成。解决方案的关键在于提出一种推理阶段的控制机制——对数几率差值放大(Logit Diff Amplification, LDA),该方法通过放大基准模型与毒性微调模型之间 logits 的差异来调整生成概率,无需重新训练模型;实验证明,LDA 在四个分类群中均能显著降低预测毒性率(以 ToxDL2 衡量),同时保持生成序列的生物学合理性(通过 Fréchet ESM Distance 和预测折叠性 pLDDT 评估),优于基于激活值引导的方法,后者常导致序列质量下降。

链接: https://arxiv.org/abs/2603.04045
作者: Manuel Fernández Burda,Santiago Aranguri,Iván Arcuschin Moreno,Enzo Ferrante
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

[AI-26] Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback IROS2026

【速读】:该论文旨在解决学习型机器人控制器在离线训练后参数固定、难以应对部署过程中未预见变化的问题。解决方案的关键在于提出一种基于世界模型预测残差的在线持续强化学习(Continual Reinforcement Learning)框架,利用DreamerV3算法实现对分布外事件的自动检测与微调触发,同时通过任务级性能信号和内部训练指标监控适应过程,从而无需外部监督即可评估收敛性,使机器人具备类似生物体的自我反思与自主改进能力。

链接: https://arxiv.org/abs/2603.04029
作者: Fabian Domberg,Georg Schildbach
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: submitted to IROS 2026

点击查看摘要

Abstract:As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.

[AI-27] A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality

【速读】:该论文旨在解决去中心化大语言模型(Large Language Model, LLM)推理网络中输出质量评估的难题,尤其是在异构计算资源和潜在恶意评估者存在的情况下,如何设计轻量且激励相容的质量评分机制。其关键解决方案是提出一种多维质量评分框架,将输出质量解耦为多个模块化维度,包括模型与成本先验、结构质量、语义质量、查询-输出对齐度以及一致性/不确定性,并通过任务日志数据系统性验证各维度的可靠性。研究发现,部分看似合理的维度在不同任务中可能表现不稳定甚至与参考质量负相关,因此通过剔除不可靠维度并重新归一化权重,构建出校准后的复合评分指标,该指标在性能上可媲美或超越单一最优评估器及共识基线。最终,该复合评分作为即插即用的质量信号集成至PoQ机制中,在对抗性评估者攻击下展现出与鲁棒聚合和自适应信任加权相结合的互补优势。

链接: https://arxiv.org/abs/2603.04028
作者: Arther Tian,Alex Ding,Frank Chen,Simon Wu,Aaron Chan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.

[AI-28] STEM Faculty Perspectives on Generative AI in Higher Education AAAI2026

【速读】:该论文试图解决的问题是:在高等教育中,生成式 AI (Generative AI) 已经广泛渗透到教学实践中,但其主要由学生驱动使用,导致教师面临如何有效、负责任地整合此类技术的挑战。当前,尽管部分教师已将其用于内容生成、评估支持和课程设计等教学场景,但更多教师仍持谨慎态度,担忧其对学习效果、评估有效性及学术诚信的影响。为制定有效的教学策略与机构政策,亟需深入理解教师视角。解决方案的关键在于重新思考评估方式、教学法以及机构治理机制,而不仅仅是技术层面的采纳——这要求高校从制度层面对 GenAI 的应用进行系统性规划,以实现技术与教育目标的深度融合。

链接: https://arxiv.org/abs/2603.04001
作者: Akila de Silva,Isabel Hyo Jung Song,Hui Yang,Shah Rukh Humayoun
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Spring Symposium - Will AI Light Up Human Creativity or Replace It?: Toward Well-Being AI for co-evolving human and machine intelligence

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) tools are increasingly present in higher education, yet their adoption has been largely student-driven, requiring instructors to respond to technologies already embedded in classroom practices. While some faculty have embraced GenAI for pedagogical purposes such as content generation, assessment support, and curriculum design, others approach these tools with caution, citing concerns about student learning, assessment validity, and academic integrity. Understanding faculty perspectives is therefore essential for informing effective pedagogical strategies and institutional policies. In this paper, we present findings from a focus group study with 29 STEM faculty members at a large public university in the United States. We examine how faculty integrate GenAI into their courses, the benefits and challenges they perceive for student learning, and the institutional support they identify as necessary for effective and responsible adoption. Our findings highlight key patterns in how STEM faculty engage with GenAI, reflecting both active adoption and cautious use. Faculty described a range of pedagogical applications alongside concerns about student learning, assessment, and academic integrity. Overall, the results suggest that effective integration of GenAI in higher education requires rethinking assessment, pedagogy, and institutional governance in addition to technical adoption.

[AI-29] Spectral Surgery: Training-Free Refinement of LoRA via Gradient-Guided Singular Value Reweighting

【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在实际应用中存在参数利用效率低下的问题:尽管LoRA通过将任务更新限制在低秩子空间以提升下游性能,但训练后的LoRA更新往往呈现“谱效率低下”现象——即大部分任务相关信号集中在少数奇异方向上,而其余方向则为中性甚至有害,导致资源浪费。解决方案的关键在于提出一种无需再训练的后处理优化方法——Spectral Surgery,其核心是基于奇异值分解(SVD)对LoRA更新进行结构化分解,利用小规模校准集上的梯度估计各奇异分量的敏感性,并在保持学习到的方向不变的前提下,通过施加幅值约束重新加权奇异值,从而实现仅调整约1000个标量系数即可显著提升LoRA性能的高效微调。

链接: https://arxiv.org/abs/2603.03995
作者: Zailong Tian,Yanzhe Chen,Zhuoheng Han,Lizi Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) improves downstream performance by restricting task updates to a low-rank parameter subspace, yet how this limited capacity is allocated within a trained adapter remains unclear. Through a geometric and empirical study across multiple tasks and backbones, we find that trained LoRA updates often exhibit an inefficient spectrum: task effects concentrate in a small subset of singular directions, while many remaining components are neutral or detrimental, motivating post-hoc refinement within the learned subspace. We propose Spectral Surgery, a training-free refinement that decomposes a LoRA update with SVD, estimates per-component sensitivity using gradients on a small calibration set, and reweights singular values under a magnitude constraint while keeping the learned directions fixed. Across Llama-3.1-8B and Qwen3-8B on four benchmarks, Spectral Surgery yields consistent gains (up to +4.4 points on CommonsenseQA and +2.4 pass@1 on HumanEval) by adjusting only \approx 1,000 scalar coefficients. These results demonstrate that SVD-structured, low-cost parameter editing can serve as a practical route to improving trained LoRA adapters in a purely post-hoc manner.

[AI-30] Measuring AI RD Automation

【速读】:该论文试图解决生成式 AI (Generative AI) 研发自动化(AIRDA)的量化评估问题,即当前缺乏能够准确反映其在现实世界中应用程度及其对AI能力进步与安全治理影响的实证数据。现有研究主要依赖能力基准测试,但无法捕捉AIRDA是否加速了能力提升超过安全进展,或是否导致监管能力滞后于研发速度等关键风险。解决方案的关键在于提出一套多维指标体系,涵盖AI研发支出中的资本占比、研究人员时间分配以及AI子系统被滥用的事件(AI subversion incidents),从而帮助决策者监测AIRDA的发展趋势、评估其潜在后果,并制定相应的安全对策。

链接: https://arxiv.org/abs/2603.03992
作者: Alan Chan,Ranay Padarath,Joe Kwon,Hilary Greaves,Markus Anderljung
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automation of AI RD (AIRDA) could have significant implications, but its extent and ultimate effects remain uncertain. We need empirical data to resolve these uncertainties, but existing data (primarily capability benchmarks) may not reflect real-world automation or capture its broader consequences, such as whether AIRDA accelerates capabilities more than safety progress or whether our ability to oversee AI RD can keep pace with its acceleration. To address these gaps, this work proposes metrics to track the extent of AIRDA and its effects on AI progress and oversight. The metrics span dimensions such as capital share of AI RD spending, researcher time allocation, and AI subversion incidents, and could help decision makers understand the potential consequences of AIRDA, implement appropriate safety measures, and maintain awareness of the pace of AI development. We recommend that companies and third parties (e.g. non-profit research organisations) start to track these metrics, and that governments support these efforts.

[AI-31] Right in Time: Reactive Reasoning in Regulated Traffic Spaces

【速读】:该论文旨在解决在共享交通空间中,自主代理(如无人机)如何通过精确的概率推理实现在线、实时的安全与合规性保障问题。传统方法受限于计算复杂度,仅适用于预飞行检查,难以应对动态环境中的不确定性。解决方案的关键在于提出一种结合概率任务设计(ProMis)与反应式电路(RC)的反应式任务设计框架,利用异构数据流中的“变化频率”将推理公式分解为可缓存的独立子任务,从而仅对受新传感器数据影响的部分进行重评估,实现了在混合域上的精确概率推理,显著提升了计算效率,使智能交通系统能够在运行过程中主动保证安全与法律合规性。

链接: https://arxiv.org/abs/2603.03977
作者: Simon Kohaut,Benedict Flade,Julian Eggert,Kristian Kersting,Devendra Singh Dhami
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Exact inference in probabilistic First-Order Logic offers a promising yet computationally costly approach for regulating the behavior of autonomous agents in shared traffic spaces. While prior methods have combined logical and probabilistic data into decision-making frameworks, their application is often limited to pre-flight checks due to the complexity of reasoning across vast numbers of possible universes. In this work, we propose a reactive mission design framework that jointly considers uncertain environmental data and declarative, logical traffic regulations. By synthesizing Probabilistic Mission Design (ProMis) with reactive reasoning facilitated by Reactive Circuits (RC), we enable online, exact probabilistic inference over hybrid domains. Our approach leverages the Frequency of Change inherent in heterogeneous data streams to subdivide inference formulas into memoized, isolated tasks, ensuring that only the specific components affected by new sensor data are re-evaluated. In experiments involving both real-world vessel data and simulated drone traffic in dense urban scenarios, we demonstrate that our approach provides orders of magnitude in speedup over ProMis without reactive paradigms. This allows intelligent transportation systems, such as Unmanned Aircraft Systems (UAS), to actively assert safety and legal compliance during operations rather than relying solely on preparation procedures.

[AI-32] Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI

【速读】:该论文试图解决生成式 AI(Generative AI)在高风险领域中将不确定性转化为看似权威的判断,从而削弱民主认知主体性(democratic epistemic agency)的问题。其解决方案的关键在于提出一种受布劳威尔(Brouwer)启发的可断言性约束(assertibility constraint):在高风险场景下,系统仅当能提供公开可检验且可争议的授权凭证(certificate of entitlement)时方可断言或否定命题,否则必须返回“未确定”(Undetermined)。这一约束构建了三状态接口语义(Asserted、Denied、Undetermined),通过凭证作为边界对象连接内部授权与公共立场,并生成随时间稳定但可随公共记录更新的时间索引授权轨迹。方案通过决策层阈值与argmax输出的门控机制实现,利用内部见证(如声学边界或分离裕度)和带理由编码的弃权协议(output contract with reason-coded abstentions)确保输出对挑战性依据负责,而非仅依赖置信度,从而在自动化话语进入公共论证时维护认知主体性。

链接: https://arxiv.org/abs/2603.03971
作者: Michael Jülich
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Preprint. 63 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Generative AI can convert uncertainty into authoritative-seeming verdicts, displacing the justificatory work on which democratic epistemic agency depends. As a corrective, I propose a Brouwer-inspired assertibility constraint for responsible AI: in high-stakes domains, systems may assert or deny claims only if they can provide a publicly inspectable and contestable certificate of entitlement; otherwise they must return “Undetermined”. This constraint yields a three-status interface semantics (Asserted, Denied, Undetermined) that cleanly separates internal entitlement from public standing while connecting them via the certificate as a boundary object. It also produces a time-indexed entitlement profile that is stable under numerical refinement yet revisable as the public record changes. I operationalize the constraint through decision-layer gating of threshold and argmax outputs, using internal witnesses (e.g., sound bounds or separation margins) and an output contract with reason-coded abstentions. A design lemma shows that any total, certificate-sound binary interface already decides the deployed predicate on its declared scope, so “Undetermined” is not a tunable reject option but a mandatory status whenever no forcing witness is available. By making outputs answerable to challengeable warrants rather than confidence alone, the paper aims to preserve epistemic agency where automated speech enters public justification.

[AI-33] Generative AI in Managerial Decision-Making: Redefining Boundaries through Ambiguity Resolution and Sycophancy Analysis

【速读】:该论文试图解决生成式 AI (Generative AI) 在模糊商业情境中提供战略建议的可靠性问题,这是当前管理决策领域的一个关键知识空白。解决方案的关键在于构建一个四维商业模糊性分类体系,并通过“人在回路”实验设计,在战略、战术和运营场景中系统评估模型对模糊性的识别能力、解析流程对响应质量的提升作用,以及其在面对错误指令时的谄媚行为模式。研究发现,尽管模型在检测内部矛盾和语境模糊方面表现优异,但在结构化语言细微差别上存在局限;而通过结构化的模糊性解析流程可显著提升所有类型决策的质量,同时揭示出不同架构模型在谄媚行为上的差异,从而为将生成式 AI 视为认知辅助工具(cognitive scaffold)提供了实证依据,并强调需由人类进行监督以保障其作为战略伙伴的可靠性。

链接: https://arxiv.org/abs/2603.03970
作者: Sule Ozturk Birim,Fabrizio Marozzo,Yigit Kazancoglu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative artificial intelligence is increasingly being integrated into complex business workflows, fundamentally shifting the boundaries of managerial decision-making. However, the reliability of its strategic advice in ambiguous business contexts remains a critical knowledge gap. This study addresses this by comparing various models on ambiguity detection, evaluating how a systematic resolution process enhances response quality, and investigating their sycophantic behavior when presented with flawed directives. Using a novel four-dimensional business ambiguity taxonomy, we conducted a human-in-the-loop experiment across strategic, tactical, and operational scenarios. The resulting decisions were assessed with an “LLM-as-a-judge” framework on criteria including agreement, actionability, justification quality, and constraint adherence. Results reveal distinct performance capabilities. While models excel in detecting internal contradictions and contextual ambiguities, they struggle with structural linguistic nuances. Ambiguity resolution consistently increased response quality across all decision types, while sycophantic behavior analysis revealed distinct patterns depending on the model architecture. This study contributes to the bounded rationality literature by positioning GAI as a cognitive scaffold that can detect and resolve ambiguities managers might overlook, but whose own artificial limitations necessitate human management to ensure its reliability as a strategic partner.

[AI-34] FWaveFormer: Temporal-Frequency Collaborative Multi-level Wavelet Transformer for Dynamic Link Prediction

【速读】:该论文旨在解决现有基于Transformer的动态链接预测方法在捕捉复杂多尺度时间动态性方面性能受限的问题。其解决方案的关键在于提出一种名为TFWaveFormer的新颖Transformer架构,该架构通过融合时频分析与多分辨率小波分解来增强对动态图中时间模式的建模能力;具体包括三个核心组件:(i) 时频协调机制,联合建模时间域与频域表示;(ii) 可学习的多分辨率小波分解模块,利用并行卷积自适应提取多尺度时间特征,替代传统迭代小波变换;(iii) 混合Transformer模块,有效融合局部小波特征与全局时间依赖关系。实验证明,该方法在多个基准数据集上显著优于现有Transformer及混合模型,验证了时频分析与小波分解结合的有效性。

链接: https://arxiv.org/abs/2603.03963
作者: Hantong Feng,Yonggang Wu,Duxin Chen,Wenwu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic link prediction plays a crucial role in diverse applications including social network analysis, communication forecasting, and financial modeling. While recent Transformer-based approaches have demonstrated promising results in temporal graph learning, their performance remains limited when capturing complex multi-scale temporal dynamics. In this paper, we propose TFWaveFormer, a novel Transformer architecture that integrates temporal-frequency analysis with multi-resolution wavelet decomposition to enhance dynamic link prediction. Our framework comprises three key components: (i) a temporal-frequency coordination mechanism that jointly models temporal and spectral representations, (ii) a learnable multi-resolution wavelet decomposition module that adaptively extracts multi-scale temporal patterns through parallel convolutions, replacing traditional iterative wavelet transforms, and (iii) a hybrid Transformer module that effectively fuses local wavelet features with global temporal dependencies. Extensive experiments on benchmark datasets demonstrate that TFWaveFormer achieves state-of-the-art performance, outperforming existing Transformer-based and hybrid models by significant margins across multiple metrics. The superior performance of TFWaveFormer validates the effectiveness of combining temporal-frequency analysis with wavelet decomposition in capturing complex temporal dynamics for dynamic link prediction tasks.

[AI-35] GIPO: Gaussian Importance Sampling Policy Optimization

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在后训练阶段面临的数据效率低下问题,尤其是在交互数据稀缺且迅速过时的多模态智能体(multimodal agents)场景中。为应对这一挑战,作者提出了一种基于截断重要性采样(truncated importance sampling)的策略优化目标GIPO(Gaussian Importance sampling Policy Optimization),其核心创新在于用基于对数比率的高斯信任权重(Gaussian trust weight)替代传统的硬裁剪(hard clipping),从而在软性抑制极端重要性比的同时保持非零梯度。理论分析表明,GIPO引入了一个隐式的、可调的更新幅度约束,同时通过集中边界保证了有限样本估计下的鲁棒性和稳定性;实验结果显示,GIPO在不同回放缓冲区大小下均优于现有基于裁剪的基线方法,在偏差-方差权衡、训练稳定性和样本效率方面表现更优。

链接: https://arxiv.org/abs/2603.03955
作者: Chengxuan Lu,Zhenquan Zhang,Shukuan Wang,Qunzhi Lin,Baigui Sun,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio-based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias–variance trade-off, high training stability and improved sample efficiency.

[AI-36] Selecting Offline Reinforcement Learning Algorithms for Stochastic Network Control

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在真实随机动态无线网络环境中的表现问题,特别是针对信道衰落、噪声和业务移动性等固有随机因素下算法鲁棒性不足的挑战。解决方案的关键在于系统评估了三类离线RL方法:基于贝尔曼方程的保守Q学习(Conservative Q-Learning)、基于序列建模的决策变换器(Decision Transformers)以及混合型的批判者引导决策变换器(Critic-Guided Decision Transformers),并结合开放获取的随机电信环境(mobile-env)进行实证分析。结果表明,保守Q学习在不同随机源下均展现出更强的策略鲁棒性,成为生命周期驱动AI管理框架中的可靠默认选择;而序列方法在高回报轨迹充足时可超越贝尔曼方法,为实际网络控制场景(如O-RAN和未来6G功能)提供了算法选型依据。

链接: https://arxiv.org/abs/2603.03932
作者: Nicolas Helson,Pegah Alizadeh,Anastasios Giovanidis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
备注: Long version 12 pages, double column including Appendix. Short version accepted at NOMS2026-IPSN, Rome, Italy

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) is a promising approach for next-generation wireless networks, where online exploration is unsafe and large amounts of operational data can be reused across the model lifecycle. However, the behavior of offline RL algorithms under genuinely stochastic dynamics – inherent to wireless systems due to fading, noise, and traffic mobility – remains insufficiently understood. We address this gap by evaluating Bellman-based (Conservative Q-Learning), sequence-based (Decision Transformers), and hybrid (Critic-Guided Decision Transformers) offline RL methods in an open-access stochastic telecom environment (mobile-env). Our results show that Conservative Q-Learning consistently produces more robust policies across different sources of stochasticity, making it a reliable default choice in lifecycle-driven AI management frameworks. Sequence-based methods remain competitive and can outperform Bellman-based approaches when sufficient high-return trajectories are available. These findings provide practical guidance for offline RL algorithm selection in AI-driven network control pipelines, such as O-RAN and future 6G functions, where robustness and data availability are key operational constraints.

[AI-37] BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning CVPR2026

【速读】:该论文旨在解决模型合并(Model Merging, MM)在测试时分布偏移(test-time distribution shift)下可靠性不足的问题,即现有方法在假设测试数据与训练数据分布一致的前提下,容易产生偏差预测并导致泛化性能下降。其解决方案的关键在于提出一种无监督的、具备偏差感知能力的模型合并框架BD-Merging:首先引入联合证据头(joint evidential head),在统一标签空间中建模不确定性以捕捉跨任务语义依赖;其次基于此构建邻域差异得分(Adjacency Discrepancy Score, ADS),量化邻近样本间的证据一致性;最后利用ADS引导的差异感知对比学习机制,优化合并表示并训练一个去偏路由器(debiased router),实现按样本自适应分配任务或层特定权重,从而有效缓解分布偏移带来的负面影响。

链接: https://arxiv.org/abs/2603.03920
作者: Yuhan Xie,Chen Lyu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Model Merging (MM) has emerged as a scalable paradigm for multi-task learning (MTL), enabling multiple task-specific models to be integrated without revisiting the original training data. Despite recent progress, the reliability of MM under test-time distribution shift remains insufficiently understood. Most existing MM methods typically assume that test data are clean and distributionally aligned with both the training and auxiliary sources. However, this assumption rarely holds in practice, often resulting in biased predictions with degraded generalization. To address this issue, we present BD-Merging, a bias-aware unsupervised model merging framework that explicitly models uncertainty to achieve adaptive reliability under distribution shift. First, BD-Merging introduces a joint evidential head that learns uncertainty over a unified label space, capturing cross-task semantic dependencies in MM. Second, building upon this evidential foundation, we propose an Adjacency Discrepancy Score (ADS) that quantifies evidential alignment among neighboring samples. Third, guided by ADS, a discrepancy-aware contrastive learning mechanism refines the merged representation by aligning consistent samples and separating conflicting ones. Combined with general unsupervised learning, this process trains a debiased router that adaptively allocates task-specific or layer-specific weights on a per-sample basis, effectively mitigating the adverse effects of distribution shift. Extensive experiments across diverse tasks demonstrate that BD-Merging achieves superior effectiveness and robustness compared to state-of-the-art MM baselines.

[AI-38] PatchDecomp: Interpretable Patch-Based Time Series Forecasting

【速读】:该论文旨在解决时间序列预测模型在高精度与可解释性之间难以平衡的问题。当前许多神经网络模型虽然预测准确率高,但其内部机制复杂,缺乏对预测结果的透明解释能力。解决方案的关键在于提出PatchDecomp方法,该方法将输入时间序列划分为若干子序列(patch),并通过聚合每个patch的贡献来生成最终预测,从而实现对每个patch(包括外生变量)的清晰归因,既保持了与最新方法相当的预测性能,又提供了定量和定性的解释能力。

链接: https://arxiv.org/abs/2603.03902
作者: Hiroki Tomioka,Genta Yoshimura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting, which predicts future values from past observations, plays a central role in many domains and has driven the development of highly accurate neural network models. However, the complexity of these models often limits human understanding of the rationale behind their predictions. We propose PatchDecomp, a neural network-based time series forecasting method that achieves both high accuracy and interpretability. PatchDecomp divides input time series into subsequences (patches) and generates predictions by aggregating the contributions of each patch. This enables clear attribution of each patch, including those from exogenous variables, to the final prediction. Experiments on multiple benchmark datasets demonstrate that PatchDecomp provides predictive performance comparable to recent forecasting methods. Furthermore, we show that the model’s explanations not only influence predicted values quantitatively but also offer qualitative interpretability through visualization of patch-wise contributions.

[AI-39] Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

【速读】:该论文旨在解决现有内存计算(In-memory Computing, IMC)硬件加速器优化框架普遍针对单一神经网络工作负载进行设计,导致硬件高度专用化且跨模型和应用场景泛化能力差的问题。其解决方案的关键在于提出一种基于优化进化算法的软硬件协同优化框架,通过显式建模多工作负载间的权衡关系而非仅优化单个模型,从而显著缩小专用设计与通用设计之间的性能差距。该方法在RRAM和SRAM两种IMC架构上均表现出强鲁棒性和适应性,相较于基线方法,在小规模(4个工作负载)和大规模(9个工作负载)优化场景下分别实现高达76.2%和95.5%的能量-延迟-面积乘积(Energy-Delay-Area Product, EDAP)降低。

链接: https://arxiv.org/abs/2603.03880
作者: Olga Krestinskaya,Mohammed E. Fouda,Ahmed Eltawil,Khaled N. Salama
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注: Accepted to IEEE Access

点击查看摘要

Abstract:Software-hardware co-design is essential for optimizing in-memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, leading to highly specialized hardware designs that do not generalize well across models and applications. In contrast, practical deployment scenarios require a single IMC platform that can efficiently support multiple neural network workloads. This work presents a joint hardware-workload co-optimization framework based on an optimized evolutionary algorithm for designing generalized IMC accelerator architectures. By explicitly capturing cross-workload trade-offs rather than optimizing for a single model, the proposed approach significantly reduces the performance gap between workload-specific and generalized IMC designs. The framework is evaluated on both RRAM- and SRAM-based IMC architectures, demonstrating strong robustness and adaptability across diverse design scenarios. Compared to baseline methods, the optimized designs achieve energy-delay-area product (EDAP) reductions of up to 76.2% and 95.5% when optimizing across a small set (4 workloads) and a large set (9 workloads), respectively. The source code of the framework is available at this https URL.

[AI-40] Structure-Aware Distributed Backdoor Attacks in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning)中后门攻击的隐蔽性问题,特别是现有研究普遍假设相同扰动在不同模型架构下具有相似效果,而忽略了模型结构对扰动传播与聚合的影响。其解决方案的关键在于提出两个结构感知指标——结构响应评分(Structural Responsiveness Score, SRS)和结构兼容系数(Structural Compatibility Coefficient, SCC),并基于此构建了一个结构感知分形扰动注入框架(TFI)。实验表明,模型架构显著影响扰动的传播与保留能力,其中多路径特征融合结构能有效放大和维持分形扰动,而SCC与攻击成功率高度相关,可预测扰动存活概率。这一发现揭示了后门行为不仅取决于扰动设计或中毒强度,更依赖于模型结构与聚合机制之间的交互作用,为结构感知的防御策略提供了新思路。

链接: https://arxiv.org/abs/2603.03865
作者: Wang Jian,Shen Hong,Ke Wei,Liu Xue Hua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 17pages,12 figures

点击查看摘要

Abstract:While federated learning protects data privacy, it also makes the model update process vulnerable to long-term stealthy perturbations. Existing studies on backdoor attacks in federated learning mainly focus on trigger design or poisoning strategies, typically assuming that identical perturbations behave similarly across different model architectures. This assumption overlooks the impact of model structure on perturbation effectiveness. From a structure-aware perspective, this paper analyzes the coupling relationship between model architectures and backdoor perturbations. We introduce two metrics, Structural Responsiveness Score (SRS) and Structural Compatibility Coefficient (SCC), to measure a model’s sensitivity to perturbations and its preference for fractal perturbations. Based on these metrics, we develop a structure-aware fractal perturbation injection framework (TFI) to study the role of architectural properties in the backdoor injection process. Experimental results show that model architecture significantly influences the propagation and aggregation of perturbations. Networks with multi-path feature fusion can amplify and retain fractal perturbations even under low poisoning ratios, while models with low structural compatibility constrain their effectiveness. Further analysis reveals a strong correlation between SCC and attack success rate, suggesting that SCC can predict perturbation survivability. These findings highlight that backdoor behaviors in federated learning depend not only on perturbation design or poisoning intensity but also on the interaction between model architecture and aggregation mechanisms, offering new insights for structure-aware defense design.

[AI-41] Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

【速读】:该论文旨在解决交互式推荐系统(Interactive Recommender Systems, IRS)中因状态估计失真导致的公平性与准确性冲突问题。现有公平性感知方法常假设观测到的用户状态能忠实反映真实偏好,但实际中隐式反馈受流行度驱动噪声和曝光偏差污染,形成误导强化学习(Reinforcement Learning, RL)代理的高熵畸变状态。解决方案的关键在于将公平推荐重构为潜在状态净化问题,并引入基于扩散模型的去噪状态表示模块(Denoising State Representation Module, DSRM),从高熵、噪声化的交互历史中恢复低熵的潜在偏好流形;在此纯净状态基础上,构建分层强化学习(Hierarchical Reinforcement Learning, HRL)机制,通过高层策略调控长期公平轨迹、低层策略在动态约束下优化短期参与度,从而打破“富者愈富”的反馈循环,实现推荐效用与曝光公平性的帕累托最优平衡。

链接: https://arxiv.org/abs/2603.03820
作者: Yun Lu,Xiaoyu Shi,Hong Xie,Xiangyu Zhao,Mingsheng Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbfDSRM-HRL, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the “rich-get-richer” feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.

[AI-42] Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

【速读】:该论文旨在解决机器人策略学习中的持续学习(continual learning)问题,即在不发生灾难性遗忘(catastrophic forgetting)的前提下,使策略模型能够随着时间推移不断习得新技能。其解决方案的关键在于利用大规模预训练的视觉-语言-动作(Vision-Language-Action, VLA)模型的内在特性:预训练显著增强了模型对遗忘的鲁棒性,使得简单的经验回放(Simple Experience Replay, ER)机制在较小回放缓冲区下即可实现近乎零遗忘的效果;同时,VLA模型即使在学习新任务时性能下降,仍能保留先前任务的相关知识,从而通过微调快速恢复被遗忘的技能。这表明大规模预训练从根本上改变了持续学习的动力学行为,使基于简单回放的持续学习成为可能。

链接: https://arxiv.org/abs/2603.03818
作者: Huihan Liu,Changyeon Kim,Bo Liu,Minghuan Liu,Yuke Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at this https URL

[AI-43] Relational In-Context Learning via Synthetic Pre-training with Structural Prior

【速读】:该论文旨在解决关系型数据库(Relational Databases, RDBs)缺乏类似文本或视觉领域基础模型(Foundation Models)的问题,其核心挑战在于高质量RDB数据私有、稀缺且结构异构,导致无法进行互联网规模的预训练。解决方案的关键在于提出首个完全基于合成数据训练的关系型基础模型——RDB-PFN,其创新性地设计了一个“关系先验生成器”(Relational Prior Generator),从结构因果模型(Structural Causal Models, SCMs)中生成无限多样化的合成RDBs,并在超过200万项单表与关系任务上进行预训练,从而实现通过真实上下文学习(in-context learning)快速适应新数据库的能力,同时在19个真实世界关系预测任务中展现出强少样本性能。

链接: https://arxiv.org/abs/2603.03805
作者: Yanbo Wang,Jiaxuan You,Chuan Shi,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, We introduce \textbfRDB-PFN , the first relational foundation model trained purely via \textbfsynthetic data . Inspired by Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a \textbfRelational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre-training on \textbfover 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine \textbfin-context learning . Experiments verify RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming graph-based and single-table foundation-model baselines (given the same DFS-linearized inputs), while using a lightweight architecture and fast inference. The code is available at this https URL

[AI-44] Zero-Knowledge Proof (ZKP) Authentication for Offline CBDC Payment System Using IoT Devices

【速读】:该论文旨在解决在资源受限的物联网(IoT)设备上实现安全、隐私保护且符合反洗钱与反恐融资(AML/CFT)合规要求的离线中央银行数字货币(CBDC)支付问题。其关键解决方案在于提出一种集成安全元件(SE)、零知识证明(ZKPs)与间歇性同步机制的隐私保护离线CBDC模型,通过轻量级零知识密码算法和混合架构(结合在线与离线支付模式),在保障双花防范、低计算开销、数字身份管理和用户隐私的同时,支持无互联网连接环境下的自动化交易,从而提升偏远地区金融包容性与支付效率。

链接: https://arxiv.org/abs/2603.03804
作者: Santanu Mondal,T. Chithralekha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Central Bank Digital Currency (CBDCs) are becoming a new digital financial tool aimed at financial inclusion, increased monetary stability, and improved efficiency of payment systems, as they are issued by central banks. One of the most important aspects is that the CBDC must offer secure offline payment methods to users, allowing them to retain cash-like access without violating Anti-Money Laundering and Counter-terrorism Financing (AML/CFT) rules. The offline CBDC ecosystems will provide financial inclusion, empower underserved communities, and ensure equitable access to digital payments, even in connectivity-poor remote locations. With the rapid growth of Internet of Things (IoT) devices in our everyday lives, they are capable of performing secure digital transactions. Integrating offline CBDC payment with IoT devices enables seamless, automated payment without internet connectivity. However, IoT devices face special challenges due to their resource-constrained nature. This makes it difficult to include features such as double-spending prevention, privacy preservation, low-computation operation, and digital identity management. The work proposes a privacy-preserving offline CBDC model with integrated secure elements (SEs), zero-knowledge proofs (ZKPs), and intermittent synchronisation to conduct offline payments on IoT hardware. The proposed model is based on recent improvements in offline CBDC prototypes, regulations and cryptographic design choices such as hybrid architecture that involves using combination of online and offline payment in IoT devices using secure hardware with lightweight zero-knowledge proof cryptographic algorithm.

[AI-45] A Rubric-Supervised Critic from Sparse Real-World Outcomes

【速读】:该论文旨在解决当前代码生成代理(Code Agent)在学术基准测试中表现优异,但在真实人类协同场景下效果不佳的问题。其核心挑战在于:学术环境依赖可验证的奖励信号(如单元测试通过),而现实场景中的成功反馈往往是稀疏、延迟且噪声较大的。为弥合这一差距,作者提出了一种从人类-代理交互轨迹中学习“评判模型”(Critic Model)的方法,其关键创新在于引入Critic Rubrics——一种基于24个行为特征的监督框架,这些特征仅需从交互日志中提取,无需额外标注;并通过半监督目标联合预测这些行为特征与稀疏的人类反馈,从而构建出可用于强化学习训练或推理阶段评分的高效评判模型。实验表明,该方法显著提升了SWE-bench上的重排序性能(Best@8提升15.9%)、支持早期终止(减少83%尝试次数)并优化训练数据选择。

链接: https://arxiv.org/abs/2603.03800
作者: Xingyao Wang,Valerie Chen,Heng Ji,Graham Neubig
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a “critic” model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

[AI-46] Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

【速读】:该论文旨在解决当前世界模型在智能体系统中面临的两难困境:一方面,手工设计的显式模拟器虽具一致性与可复现性,但难以适应新环境;另一方面,隐式神经模型虽灵活,却在长期运行中缺乏约束、验证和调试能力。为实现可靠性和灵活性之间的原则性平衡,论文提出基于自然语言规范直接合成显式、可执行的离散事件世界模型(Discrete-Event World Models),其关键在于采用DEVS(Discrete Event System Specification)形式化框架,并引入分阶段的大语言模型(LLM)生成流水线——将组件间交互结构推理与组件级事件时序逻辑分离建模,从而在无需唯一真实标签的情况下,通过结构化事件轨迹与规范导出的时序及语义约束进行验证,实现长周期滚动中的行为一致性、可观测行为驱动的可验证性以及在线执行时按需高效合成的能力。

链接: https://arxiv.org/abs/2603.03784
作者: Zheyu Chen,Zhuohuan Li,Chuanhao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 5 figures

点击查看摘要

Abstract:World models are essential for planning and evaluation in agentic systems, yet existing approaches lie at two extremes: hand-engineered simulators that offer consistency and reproducibility but are costly to adapt, and implicit neural models that are flexible but difficult to constrain, verify, and debug over long horizons. We seek a principled middle ground that combines the reliability of explicit simulators with the flexibility of learned models, allowing world models to be adapted during online execution. By targeting a broad class of environments whose dynamics are governed by the ordering, timing, and causality of discrete events, such as queueing and service operations, embodied task planning, and message-mediated multi-agent coordination, we advocate explicit, executable discrete-event world models synthesized directly from natural-language specifications. Our approach adopts the DEVS formalism and introduces a staged LLM-based generation pipeline that separates structural inference of component interactions from component-level event and timing logic. To evaluate generated models without a unique ground truth, simulators emit structured event traces that are validated against specification-derived temporal and semantic constraints, enabling reproducible verification and localized diagnostics. Together, these contributions produce world models that are consistent over long-horizon rollouts, verifiable from observable behavior, and efficient to synthesize on demand during online execution.

[AI-47] LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

【速读】:该论文旨在解决现有记忆基准测试主要聚焦于陈述性记忆(declarative memory),即语义和情景记忆类型,而忽视了现实世界中同样重要的非陈述性记忆(non-declarative memory),如习惯性和程序性记忆的问题。这类非陈述性记忆需从多样化的数字痕迹中推断得出,且在长期、跨时间的情境中发挥作用。为填补这一空白,作者提出Lifebench,其核心创新在于构建了一个高度连接、长时程事件模拟环境,要求智能体整合陈述性和非陈述性记忆推理能力。解决方案的关键在于两个方面:一是通过引入真实世界的先验知识(如匿名社交调查、地图API和节假日日历)保障数据质量与行为合理性;二是借鉴认知科学中的部分层级结构(partonomic hierarchy)设计事件生成逻辑,实现高效并行化生成同时保持全局一致性,从而有效提升基准的可扩展性与真实性。

链接: https://arxiv.org/abs/2603.03781
作者: Zihao Cheng,Weixin Wang,Yu Zhao,Ziyang Ren,Jiaxuan Chen,Ruiyang Xu,Shuai Huang,Yang Chen,Guowei Li,Mengshi Wang,Yi Xie,Ren Zhu,Zeren Jiang,Keda Lu,Yihong Li,Xiaoliang Wang,Liwei Liu,Cam-Tu Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: A total of 28 pages, 8 pages of main text, and 15 figures and tables

点击查看摘要

Abstract:Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real-world priors, including anonymized social surveys, map APIs, and holiday-integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top-tier, state-of-the-art memory systems reach just 55.2% accuracy, highlighting the inherent difficulty of long-horizon retrieval and multi-source integration within our proposed benchmark. The dataset and data synthesis code are available at this https URL.

[AI-48] owards Effective Orchestration of AI x DB Workloads

【速读】:该论文旨在解决将人工智能(AI)模型直接集成到数据库引擎中(即AIxDB架构)所面临的系统性挑战,包括联合查询处理与模型执行的管理、端到端性能优化、资源竞争下的执行协调以及强安全性和访问控制保障等问题。其解决方案的关键在于重新设计数据库核心组件(如事务管理和访问控制),以支持AI生命周期管理、缓解数据漂移(data drift)并保护敏感数据免受未经授权的AI操作,同时通过合理的查询优化、调度策略和异构硬件上的分布式执行机制,提升AIxDB查询的服务性能。

链接: https://arxiv.org/abs/2603.03772
作者: Naili Xing,Haotian Gao,Zhanhao Zhao,Shaofeng Cai,Zhaojing Luo,Yuncheng Wu,Zhongle Xie,Meihui Zhang,Beng Chin Ooi
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-driven analytics are increasingly crucial to data-centric decision-making. The practice of exporting data to machine learning runtimes incurs high overhead, limits robustness to data drift, and expands the attack surface, especially in multi-tenant, heterogeneous data systems. Integrating AI directly into database engines, while offering clear benefits, introduces challenges in managing joint query processing and model execution, optimizing end-to-end performance, coordinating execution under resource contention, and enforcing strong security and access-control guarantees. This paper discusses the challenges of joint DB-AI, or AIxDB, data management and query processing within AI-powered data systems. It presents various challenges that need to be addressed carefully, such as query optimization, execution scheduling, and distributed execution over heterogeneous hardware. Database components such as transaction management and access control need to be re-examined to support AI lifecycle management, mitigate data drift, and protect sensitive data from unauthorized AI operations. We present a design and preliminary results to demonstrate what may be key to the performance for serving AIxDB queries. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03772 [cs.DB] (or arXiv:2603.03772v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.03772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-49] Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport

【速读】:该论文旨在解决多智能体人机协作(Human-Robot Collaboration, HRC)中长期协调决策与物理执行难以协同的问题,尤其在接触约束、可行性及安全性下实现持续的系统性推理(System 2-style deliberation)与低延迟连续控制的融合。其关键解决方案是提出一种分层认知到控制(Cognition-to-Control, C2C)架构,包含三层结构:(i) 基于视觉语言模型(Vision-Language Model, VLM)的接地层,用于维持场景指称并推断具身感知的可操作性与约束;(ii) 推理型技能/协调层——作为System 2核心,通过去中心化多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)建模为具有共享势函数的马尔可夫势博弈,优化长时程技能选择与序列;(iii) 全身控制层,在高频下执行选定技能并保证运动学/动力学可行性和接触稳定性。该架构通过残差策略实现对合作方动态的内化建模,无需显式角色分配,实验证明其在协作操作任务中显著优于单智能体和端到端基线方法,具备更高成功率、鲁棒性以及稳定的协调行为,甚至涌现出领导者-跟随者模式。

链接: https://arxiv.org/abs/2603.03768
作者: Hao Zhang,Ding Zhao,H. Eric Tseng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.

[AI-50] Agent ic Peer-to-Peer Networks: From Content Distribution to Capability and Action Sharing

【速读】:该论文旨在解决本地边缘设备上运行的生成式 AI (Generative AI) 客户端自主代理(Client-Side Autonomous Agents, CSAAs)之间协作时所面临的网络基础架构挑战,特别是当这些代理通过去中心化方式直接委托子任务形成智能对等网络(Agentic Peer-to-Peer, P2P Networks)时,如何安全、高效地发现并执行异构、状态依赖且潜在不安全的能力。解决方案的关键在于提出一种基于平面的参考架构,将连接性/身份、语义发现与执行解耦,并引入带签名的软状态能力描述符以支持意图和约束感知的发现机制;同时设计分层验证谱系(Tiered Verification Spectrum),从声誉信号(Tier 1)、轻量级 Canary 挑战响应(Tier 2)到证据包(如签名工具收据或证明,Tier 3),在对抗环境中保障协作安全性与可靠性,实验证明该方案显著提升端到端工作流成功率,同时保持发现延迟稳定且控制平面开销可控。

链接: https://arxiv.org/abs/2603.03753
作者: Taotao Wang,Lizhao You,Jingwen Tong,Chonghe Zhao,Shengli Zhang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The ongoing shift of AI models from centralized cloud APIs to local AI agents on edge devices is enabling \textitClient-Side Autonomous Agents (CSAAs) – persistent personal agents that can plan, access local context, and invoke tools on behalf of users. As these agents begin to collaborate by delegating subtasks directly between clients, they naturally form \emphAgentic Peer-to-Peer (P2P) Networks. Unlike classic file-sharing overlays where the exchanged object is static, hash-indexed content (e.g., files in BitTorrent), agentic overlays exchange \emphcapabilities and actions that are heterogeneous, state-dependent, and potentially unsafe if delegated to untrusted peers. This article outlines the networking foundations needed to make such collaboration practical. We propose a plane-based reference architecture that decouples connectivity/identity, semantic discovery, and execution. Besides, we introduce signed, soft-state capability descriptors to support intent- and constraint-aware discovery. To cope with adversarial settings, we further present a \textittiered verification spectrum: Tier~1 relies on reputation signals, Tier~2 applies lightweight canary challenge-response with fallback selection, and Tier~3 requires evidence packages such as signed tool receipts/traces (and, when applicable, attestation). Using a discrete-event simulator that models registry-based discovery, Sybil-style index poisoning, and capability drift, we show that tiered verification substantially improves end-to-end workflow success while keeping discovery latency near-constant and control-plane overhead modest.

[AI-51] Interaction-Aware Whole-Body Control for Compliant Object Transport

【速读】:该论文旨在解决非结构化环境中协作式物体搬运任务中,由于强时变接触力导致追踪中心型全身控制(whole-body control, WBC)不可靠的问题,尤其是在人机紧密接触的支持类任务中。其解决方案的关键在于提出一种仿生的交互导向型全身控制(interaction-oriented whole-body control, IO-WBC),该方法作为人工小脑(artificial cerebellum)运行,将上层技能级指令转化为在接触状态下稳定且物理一致的全身行为;通过结构化分离上肢交互执行与下肢支撑控制,使机器人在保持平衡的同时能够调控紧耦合机器人-物体系统中的力交换。此外,结合轨迹优化参考生成器(trajectory-optimized reference generator, RG)和强化学习(reinforcement learning, RL)策略,在仿真中训练具有随机负载质量和外部扰动鲁棒性的策略,并通过不对称师生蒸馏部署至实际系统,仅依赖本体感知历史实现高效响应。

链接: https://arxiv.org/abs/2603.03751
作者: Hao Zhang,Yves Tseng,Ding Zhao,H. Eric Tseng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cooperative object transport in unstructured environments remains challenging for assistive humanoids because strong, time-varying interaction forces can make tracking-centric whole-body control unreliable, especially in close-contact support tasks. This paper proposes a bio-inspired, interaction-oriented whole-body control (IO-WBC) that functions as an artificial cerebellum - an adaptive motor agent that translates upstream (skill-level) commands into stable, physically consistent whole-body behavior under contact. This work structurally separates upper-body interaction execution from lower-body support control, enabling the robot to maintain balance while shaping force exchange in a tightly coupled robot-object system. A trajectory-optimized reference generator (RG) provides a kinematic prior, while a reinforcement learning (RL) policy governs body responses under heavy-load interactions and disturbances. The policy is trained in simulation with randomized payload mass/inertia and external perturbations, and deployed via asymmetric teacher-student distillation so that the student relies only on proprioceptive histories at runtime. Extensive experiments demonstrate that IO-WBC maintains stable whole-body behavior and physical interaction even when precise velocity tracking becomes infeasible, enabling compliant object transport across a wide range of scenarios.

[AI-52] JANUS: Structured Bidirectional Generation for Guaranteed Constraints and Analytical Uncertainty

【速读】:该论文旨在解决高风险场景下合成数据生成中的四重困境(Quadrilemma):即在保持与原始数据分布的保真度(Fidelity)、控制复杂逻辑约束(Control)、可靠估计不确定性(Reliability)以及计算效率(Efficiency)之间实现协同优化。现有方法如CTGAN和TabDDPM虽能保证高保真度,但依赖低效的拒绝采样处理连续范围约束;而结构因果模型(Structural Causal Models)虽具备逻辑控制能力,却难以应对高维数据保真和复杂噪声反演问题。本文提出JANUS框架,其核心创新在于“逆拓扑回填”(Reverse-Topological Back-filling)算法,通过在贝叶斯决策树构成的有向无环图(DAG)中逆向传播约束条件,无需拒绝采样即可实现100%约束满足;同时结合基于狄利克雷先验的解析不确定性分解方法,使不确定性估计速度比蒙特卡洛方法快128倍,从而在15个数据集和523种约束场景下实现了最优保真度(检测分数0.497),并精确处理多列间复杂约束(如Salary_offered = Salary_requested)。

链接: https://arxiv.org/abs/2603.03748
作者: Taha Racicot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures, 14 tables

点击查看摘要

Abstract:High-stakes synthetic data generation faces a fundamental Quadrilemma: achieving Fidelity to the original distribution, Control over complex logical constraints, Reliability in uncertainty estimation, and Efficiency in computational cost – simultaneously. State-of-the-art Deep Generative Models (CTGAN, TabDDPM) excel at fidelity but rely on inefficient rejection sampling for continuous range constraints. Conversely, Structural Causal Models offer logical control but struggle with high-dimensional fidelity and complex noise inversion. We introduce JANUS (Joint Ancestral Network for Uncertainty and Synthesis), a framework that unifies these capabilities using a DAG of Bayesian Decision Trees. Our key innovation is Reverse-Topological Back-filling, an algorithm that propagates constraints backwards through the causal graph, achieving 100% constraint satisfaction on feasible constraint sets without rejection sampling. This is paired with an Analytical Uncertainty Decomposition derived from Dirichlet priors, enabling 128x faster uncertainty estimation than Monte Carlo methods. Across 15 datasets and 523 constrained scenarios, JANUS achieves state-of-the-art fidelity (Detection Score 0.497), eliminates mode collapse on imbalanced data, and provides exact handling of complex inter-column constraints (e.g., Salary_offered = Salary_requested) where baselines fail entirely.

[AI-53] RAG Nav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation

【速读】:该论文旨在解决多目标视觉语言导航(Multi-Goal Vision-Language Navigation, Multi-Goal VLN)任务中因缺乏显式空间结构而导致的空间幻觉(spatial hallucinations)和规划漂移(planning drift)问题,尤其是在处理多个目标实体之间的关联时。解决方案的关键在于提出RAGNav框架,其核心是双基记忆系统(Dual-Basis Memory),该系统融合了低层拓扑地图(topological map)以保持物理连通性与高层语义森林(semantic forest)以实现环境层次抽象,从而在语义推理与物理结构之间建立桥梁;在此基础上,引入锚点引导的条件检索机制与拓扑邻域得分传播机制,有效筛选候选目标、消除语义噪声,并通过拓扑结构增强跨目标可达性推理能力与顺序规划效率,最终显著提升复杂多目标导航任务中的性能表现。

链接: https://arxiv.org/abs/2603.03745
作者: Ling Luo,Qiangian Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language Navigation (VLN) is evolving from single-point pathfinding toward the more challenging Multi-Goal VLN. This task requires agents to accurately identify multiple entities while collaboratively reasoning over their spatial-physical constraints and sequential execution order. However, generic Retrieval-Augmented Generation (RAG) paradigms often suffer from spatial hallucinations and planning drift when handling multi-object associations due to the lack of explicit spatial this http URL address these challenges, we propose RAGNav, a framework that bridges the gap between semantic reasoning and physical structure. The core of RAGNav is a Dual-Basis Memory system, which integrates a low-level topological map for maintaining physical connectivity with a high-level semantic forest for hierarchical environment abstraction. Building on this representation, the framework introduces an anchor-guided conditional retrieval and a topological neighbor score propagation mechanism. This approach facilitates the rapid screening of candidate targets and the elimination of semantic noise, while performing semantic calibration by leveraging the physical associations inherent in the topological this http URL mechanism significantly enhances the capability of inter-target reachability reasoning and the efficiency of sequential planning. Experimental results demonstrate that RAGNav achieves state-of-the-art (SOTA) performance in complex multi-goal navigation tasks.

[AI-54] HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

【速读】:该论文旨在解决人机协作(Human-Robot Collaboration, HRC)中因机器人与人类行为和情境的组合多样性导致的泛化能力弱、鲁棒性差的问题,其核心挑战在于机器人与人类之间的理性差距(Rationality Gap, RG)——即去中心化最优响应动态与集中式协同上升之间存在的变分不匹配。为应对这一问题,作者提出异构智能体李雅普诺夫策略优化(Heterogeneous-Agent Lyapunov Policy Optimization, HALyPO),其关键创新在于在策略参数空间中直接建立形式化稳定性:通过施加每步策略参数空间中分歧度量的李雅普诺夫递减条件,利用最优二次投影修正去中心化梯度,确保理性差距的单调收缩,从而实现开放交互空间中的有效探索,并显著提升协作场景下的泛化性能与鲁棒性。

链接: https://arxiv.org/abs/2603.03741
作者: Hao Zhang,Yaru Niu,Yikai Wang,Ding Zhao,H. Eric Tseng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

[AI-55] Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information ICLR2026

【速读】:该论文旨在解决深度学习模型在训练过程中可能非法学习到未经授权的敏感数据所带来的隐私与安全问题,即如何生成有效的不可学习样本(unlearnable examples),以阻碍恶意模型对训练数据的泛化能力。其解决方案的关键在于从互信息(mutual information)减少的角度重新审视并优化不可学习样本的构造:通过理论证明降低类内污染特征间的条件协方差可有效减少分布间的互信息,并据此提出一种基于最大化类内特征间余弦相似度来最小化协方差的新方法——Mutual Information Unlearnable Examples (MI-UE),从而显著提升不可学习性,且在面对防御机制时仍保持优越性能。

链接: https://arxiv.org/abs/2603.03725
作者: Yifan Zhu,Yibo Miao,Yinpeng Dong,Xiao-Shan Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, ICLR 2026

点击查看摘要

Abstract:The volume of freely scraped data on the Internet has driven the tremendous success of deep learning. Along with this comes the growing concern about data privacy and security. Numerous methods for generating unlearnable examples have been proposed to prevent data from being illicitly learned by unauthorized deep models by impeding generalization. However, the existing approaches primarily rely on empirical heuristics, making it challenging to enhance unlearnable examples with solid explanations. In this paper, we analyze and improve unlearnable examples from a novel perspective: mutual information reduction. We demonstrate that effective unlearnable examples always decrease mutual information between clean features and poisoned features, and when the network gets deeper, the unlearnability goes better together with lower mutual information. Further, we prove from a covariance reduction perspective that minimizing the conditional covariance of intra-class poisoned features reduces the mutual information between distributions. Based on the theoretical results, we propose a novel unlearnable method called Mutual Information Unlearnable Examples (MI-UE) that reduces covariance by maximizing the cosine similarity among intra-class features, thus impeding the generalization effectively. Extensive experiments demonstrate that our approach significantly outperforms the previous methods, even under defense mechanisms.

[AI-56] Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

【速读】:该论文旨在解决机器人在部分可观测环境中的规划问题,即当环境中存在未知或不可见物体时,如何通过不确定性推理来高效完成任务与运动规划(Task and Motion Planning, TAMP)。传统方法在执行计划过程中若遇到任务无关物体常忽略其影响,导致规划效率低下。解决方案的关键在于提出一种名为CoCo-TAMP的分层状态估计框架,该框架利用大语言模型(Large Language Models, LLMs)引导的常识知识来增强对任务相关对象的信念分布:一方面利用“特定物体更可能出现在特定位置”的空间先验,另一方面利用“相似物体倾向于共存、异类物体较少共现”的关联性先验,从而显著提升长期规划的效率。实验表明,相较于未引入任何常识知识的基线方法,CoCo-TAMP在仿真和真实场景中分别将规划与执行时间平均减少62.7和72.6。

链接: https://arxiv.org/abs/2603.03704
作者: Yoonwoo Kim,Raghav Arora,Roberto Martín-Martín,Peter Stone,Ben Abbatematteo,Yoonchang Sung
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.

[AI-57] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

【速读】:该论文旨在解决化学配方自动化设计中面临的高维组合空间探索难题,尤其是大型语言模型(Large Language Model, LLM)在长程推理时受限于上下文窗口长度以及路径依赖性导致的模式坍缩问题。解决方案的关键在于提出一种闭环神经符号框架AI4S-SDS,其核心创新包括:1)引入稀疏状态存储(Sparse State Storage)与动态路径重构机制,将推理历史与上下文长度解耦,实现固定令牌预算下的任意深度探索;2)设计全局-局部搜索策略,通过记忆驱动的规划模块自适应重置搜索根节点,并结合兄弟节点感知扩展机制促进节点级正交探索,提升覆盖广度;3)构建可微分物理引擎(Differentiable Physics Engine),以混合归一化损失与稀疏正则化联合优化连续配比,在满足热力学约束的前提下实现符号推理与物理可行性的统一。

链接: https://arxiv.org/abs/2603.03686
作者: Jiangyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated design of chemical formulations is a cornerstone of materials science, yet it requires navigating a high-dimensional combinatorial space involving discrete compositional choices and continuous geometric constraints. Existing Large Language Model (LLM) agents face significant challenges in this setting, including context window limitations during long-horizon reasoning and path-dependent exploration that may lead to mode collapse. To address these issues, we introduce AI4S-SDS, a closed-loop neuro-symbolic framework that integrates multi-agent collaboration with a tailored Monte Carlo Tree Search (MCTS) engine. We propose a Sparse State Storage mechanism with Dynamic Path Reconstruction, which decouples reasoning history from context length and enables arbitrarily deep exploration under fixed token budgets. To reduce local convergence and improve coverage, we implement a Global–Local Search Strategy: a memory-driven planning module adaptively reconfigures the search root based on historical feedback, while a Sibling-Aware Expansion mechanism promotes orthogonal exploration at the node level. Furthermore, we bridge symbolic reasoning and physical feasibility through a Differentiable Physics Engine, employing a hybrid normalized loss with sparsity-inducing regularization to optimize continuous mixing ratios under thermodynamic constraints. Empirical results show that AI4S-SDS achieves full validity under the adopted HSP-based physical constraints and substantially improves exploration diversity compared to baseline agents. In preliminary lithography experiments, the framework identifies a novel photoresist developer formulation that demonstrates competitive or superior performance relative to a commercial benchmark, highlighting the potential of diversity-driven neuro-symbolic search for scientific discovery.

[AI-58] MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在非平稳环境(non-stationary environments)中缺乏长期适应能力的问题,尤其是当环境中存在反馈时,传统方法如上下文学习(In-Context Learning)和外部记忆难以实现策略的内化改进。解决方案的关键在于提出一种基于元强化学习(Meta-Reinforcement Learning, meta-RL)的框架 MAGE,其通过多回合训练机制将交互历史与反思信息整合进上下文窗口,并以最终回合奖励作为优化目标,从而激励代理基于过往经验持续优化策略;同时结合群体训练(population-based training)与代理特定的优势归一化技术,提升代理多样性并保障学习稳定性,使LLM代理具备战略性的探索与利用能力,并在未见过的对手上展现出良好泛化性能。

链接: https://arxiv.org/abs/2603.03680
作者: Lu Yang,Zelai Xu,Minyang Xie,Jiaxuan Gao,Zhao Shok,Yu Wang,Yi Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus primarily on exploration in single-agent settings, neglecting the strategic exploitation necessary for multi-agent environments. We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population-based training with an agent-specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at this https URL.

[AI-59] Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

【速读】:该论文旨在解决Shapley值在数据估值中因联盟空间指数级增长而导致的计算复杂度问题(#P-hard),尤其是在现代预测模型中,对于特定测试实例仅有少量训练样本影响最终预测结果这一结构特性被忽略的问题。解决方案的关键在于利用模型诱导的局部性(model-induced locality),通过定义由模型计算路径决定的支持集(support sets,如KNN中的邻域、树模型中的叶节点、图神经网络中的感受野)来重构Shapley值的计算过程——将原本全局的联盟枚举转化为基于支持集的结构化子集处理问题。作者进一步证明了局部Shapley的内在复杂度由不同有影响力子集的数量决定,并据此提出LSMR(Local Shapley via Model Reuse)算法,该算法通过支持映射与枢纽调度机制,仅需对每个有影响力子集训练一次即可实现最优效率;针对更大支持集场景,还设计了LSMR-A,一种考虑重用的无偏蒙特卡洛估计器,其收敛速度与采样子集数量相关而非总采样次数,显著降低重训练开销并保持高估值保真度。

链接: https://arxiv.org/abs/2603.03672
作者: Xuan Yang,Hsi-Wen Chen,Ming-Syan Chen,Jian Pei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model’s computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations. Guided by this result, we propose LSMR (Local Shapley via Model Reuse), an optimal subset-centric algorithm that trains each influential subset exactly once via support mapping and pivot scheduling. For larger supports, we develop LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration, with runtime determined by the number of distinct sampled subsets rather than total draws. Experiments across multiple model families demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity.

[AI-60] Graph Negative Feedback Bias Correction Framework for Adaptive Heterophily Modeling

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在异质性图(heterophilic graphs)上性能下降的问题,其根源在于传统GNN模型对同质性假设(homophily assumption)的依赖,导致预测结果受到标签自相关性(label autocorrelation)引入的偏差影响。解决方案的关键在于提出一种与具体聚合策略无关的图负反馈偏差校正框架(Graph Negative Feedback Bias Correction, GNFBC),其核心机制是通过引入一个负反馈损失项来惩罚预测对标签自相关性的敏感度,并利用图无关模型(graph-agnostic models)的输出作为反馈项,借助独立节点特征信息,结合Dirichlet能量引导,有效抵消由相关性引起的偏差,从而提升GNN在异质性图上的泛化能力。

链接: https://arxiv.org/abs/2603.03662
作者: Jiaqi Lv,Qingfeng Du,Yu Zhang,Yongqi Han,Sheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful framework for processing graph-structured data. However, conventional GNNs and their variants are inherently limited by the homophily assumption, leading to degradation in performance on heterophilic graphs. Although substantial efforts have been made to mitigate this issue, they remain constrained by the message-passing paradigm, which is inherently rooted in homophily. In this paper, a detailed analysis of how the underlying label autocorrelation of the homophily assumption introduces bias into GNNs is presented. We innovatively leverage a negative feedback mechanism to correct the bias and propose Graph Negative Feedback Bias Correction (GNFBC), a simple yet effective framework that is independent of any specific aggregation strategy. Specifically, we introduce a negative feedback loss that penalizes the sensitivity of predictions to label autocorrelation. Furthermore, we incorporate the output of graph-agnostic models as a feedback term, leveraging independent node feature information to counteract correlation-induced bias guided by Dirichlet energy. GNFBC can be seamlessly integrated into existing GNN architectures, improving overall performance with comparable computational and memory overhead.

[AI-61] Mozi: Governed Autonomy for Drug Discovery LLM Agents

【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models, LLMs)在高风险领域(如药物发现)中部署时面临的两大关键挑战:一是缺乏对工具使用的有效治理机制,导致自主代理在复杂依赖流程中产生不可重现的决策路径;二是长期任务中的可靠性不足,早期幻觉会随执行链路逐级放大,引发下游失败。解决方案的核心在于提出Mozi双层架构:第一层(控制平面,Control Plane)通过角色隔离、受限动作空间和基于反思的重规划机制实现受控的监督-执行层级结构;第二层(工作流平面,Workflow Plane)将药物发现流程建模为状态感知、可组合的技能图(skill graphs),并引入严格的数据契约与战略性人机协同(Human-in-the-Loop, HITL)检查点以保障科学有效性。该设计遵循“自由形式推理用于安全任务,结构化执行用于长程管线”的原则,在保证灵活性的同时确保了端到端的鲁棒性和可审计性,从而显著抑制误差累积,使LLM从脆弱的对话系统转变为可靠的、受控的科研协作者。

链接: https://arxiv.org/abs/2603.03655
作者: He Cao,Siyu Liu,Fan Zhang,Zijing Liu,Hao Li,Bin Feng,Shengyuan Bai,Leqing Chen,Kai Xie,Yu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool-augmented large language model (LLM) agents promise to unify scientific reasoning with computation, yet their deployment in high-stakes domains like drug discovery is bottlenecked by two critical barriers: unconstrained tool-use governance and poor long-horizon reliability. In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories, where early-stage hallucinations multiplicatively compound into downstream failures. To overcome this, we present Mozi, a dual-layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology. Layer A (Control Plane) establishes a governed supervisor–worker hierarchy that enforces role-based tool isolation, limits execution to constrained action spaces, and drives reflection-based replanning. Layer B (Workflow Plane) operationalizes canonical drug discovery stages – from Target Identification to Lead Optimization – as stateful, composable skill graphs. This layer integrates strict data contracts and strategic human-in-the-loop (HITL) checkpoints to safeguard scientific validity at high-uncertainty decision boundaries. Operating on the design principle of ``free-form reasoning for safe tasks, structured execution for long-horizon pipelines,‘’ Mozi provides built-in robustness mechanisms and trace-level audibility to completely mitigate error accumulation. We evaluate Mozi on PharmaBench, a curated benchmark for biomedical agents, demonstrating superior orchestration accuracy over existing baselines. Furthermore, through end-to-end therapeutic case studies, we demonstrate Mozi’s ability to navigate massive chemical spaces, enforce stringent toxicity filters, and generate highly competitive in silico candidates, effectively transforming the LLM from a fragile conversationalist into a reliable, governed co-scientist. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03655 [cs.AI] (or arXiv:2603.03655v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.03655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-62] Goal-Driven Risk Assessment for LLM -Powered Systems: A Healthcare Case Study ALT ACSA

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗等关键应用系统中引入的新安全挑战,特别是由对抗性模型、提示注入(prompt injection)与传统网络攻击共同构成的潜在攻击链(cyber kill chain)所带来的风险难以量化和有效管理的问题。解决方案的关键在于提出一种结构化、目标驱动的风险评估方法,通过攻击树(attack trees)将威胁具体化为详细的攻击向量、前置条件和攻击路径,从而实现对LLM集成系统的可操作性风险优先级排序,并推动“安全即设计”(secure-by-design)实践在LLM系统中的落地。

链接: https://arxiv.org/abs/2603.03633
作者: Neha Nagaraja,Hayretdin Bahsi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in the HealthSec Workshop at the 2025 IEEE Annual Computer Security Applications Conference (ACSAC)

点击查看摘要

Abstract:While incorporating LLMs into systems offers significant benefits in critical application areas such as healthcare, new security challenges emerge due to the potential cyber kill chain cycles that combine adversarial model, prompt injection and conventional cyber attacks. Threat modeling methods enable the system designers to identify potential cyber threats and the relevant mitigations during the early stages of development. Although the cyber security community has extensive experience in applying these methods to software-based systems, the elicited threats are usually abstract and vague, limiting their effectiveness for conducting proper likelihood and impact assessments for risk prioritization, especially in complex systems with novel attacks surfaces, such as those involving LLMs. In this study, we propose a structured, goal driven risk assessment approach that contextualizes the threats with detailed attack vectors, preconditions, and attack paths through the use of attack trees. We demonstrate the proposed approach on a case study with an LLM agent-based healthcare system. This study harmonizes the state-of-the-art attacks to LLMs with conventional ones and presents possible attack paths applicable to similar systems. By providing a structured risk assessment, this study makes a significant contribution to the literature and advances the secure-by-design practices in LLM-based systems.

[AI-63] Role-Aware Conditional Inference for Spatiotemporal Ecosystem Carbon Flux Prediction

【速读】:该论文旨在解决陆地生态系统碳通量(如CO₂、GPP和CH₄)预测中因时空异质性导致的模型泛化能力差的问题。现有基于学习的方法通常将环境协变量视为同质输入空间,隐含假设存在全局响应函数,从而在不同生态系统的复杂条件下表现脆弱。解决方案的关键在于提出一种过程启发的条件推理框架——Role-Aware Conditional Inference (RACI),其通过分层时间编码分离缓慢的制度条件因子与快速动态驱动因子,并引入角色感知的空间检索机制,为每种功能角色提供功能相似且地理邻近的上下文信息。这种显式建模不同功能角色的方式使模型能够在不训练局部专用模型或依赖固定空间结构的前提下,自适应地调整预测结果,显著提升了在多种生态系统类型、碳通量指标和数据来源下的准确性和空间泛化能力。

链接: https://arxiv.org/abs/2603.03531
作者: Yiming Sun,Runlong Yu,Rongchao Dong,Shuo Chen,Licheng Liu,Youmi Oh,Qianlai Zhuang,Yiqun Xie,Xiaowei Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of terrestrial ecosystem carbon fluxes (e.g., CO _2 , GPP, and CH _4 ) is essential for understanding the global carbon cycle and managing its impacts. However, prediction remains challenging due to strong spatiotemporal heterogeneity: ecosystem flux responses are constrained by slowly varying regime conditions, while short-term fluctuations are driven by high-frequency dynamic forcings. Most existing learning-based approaches treat environmental covariates as a homogeneous input space, implicitly assuming a global response function, which leads to brittle generalization across heterogeneous ecosystems. In this work, we propose Role-Aware Conditional Inference (RACI), a process-informed learning framework that formulates ecosystem flux prediction as a conditional inference problem. RACI employs hierarchical temporal encoding to disentangle slow regime conditioners from fast dynamic drivers, and incorporates role-aware spatial retrieval that supplies functionally similar and geographically local context for each role. By explicitly modeling these distinct functional roles, RACI enables a model to adapt its predictions across diverse environmental regimes without training separate local models or relying on fixed spatial structures. We evaluate RACI across multiple ecosystem types (wetlands and agricultural systems), carbon fluxes (CO _2 , GPP, CH _4 ), and data sources, including both process-based simulations and observational measurements. Across all settings, RACI consistently outperforms competitive spatiotemporal baselines, demonstrating improved accuracy and spatial generalization under pronounced environmental heterogeneity.

[AI-64] Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning

【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中预训练表示在少样本场景下的迁移性能差异以及多任务学习中不同任务间干扰的问题。其核心挑战在于理解为何某些SSL表征能够实现强少样本泛化能力,同时又能支持多个任务的共存而互不干扰。解决方案的关键在于提出并分析一个几何量——方向性决策轴方差(directional CDNV, decision-axis variance),该量衡量了类间可分方向上的特征变异性。研究表明,当该方差较小时,既能保证下游分类任务中的低泛化误差(通过严格的非渐近多类泛化界验证),又能促使不同任务的决策轴趋于正交,从而降低多任务间的干扰。实验表明,即使传统CDNV保持较大,方向性CDNV在预训练过程中也会显著下降,且其变化趋势与实际少样本误差高度一致,揭示了该指标作为SSL表示质量判据的有效性。

链接: https://arxiv.org/abs/2603.03530
作者: Achleshwar Luthra,Yash Salunkhe,Tomer Galanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frozen self-supervised representations often transfer well with only a few labels across many semantic tasks. We argue that a single geometric quantity, \emphdirectional CDNV (decision-axis variance), sits at the core of two favorable behaviors: strong few-shot transfer within a task, and low interference across many tasks. We show that both emerge when variability \emphalong class-separating directions is small. First, we prove sharp non-asymptotic multiclass generalization bounds for downstream classification whose leading term is the directional CDNV. The bounds include finite-shot corrections that cleanly separate intrinsic decision-axis variability from centroid-estimation error. Second, we link decision-axis collapse to multitask geometry: for independent balanced labelings, small directional CDNV across tasks forces the corresponding decision axes to be nearly orthogonal, helping a single representation support many tasks with minimal interference. Empirically, across SSL objectives, directional CDNV collapses during pretraining even when classical CDNV remains large, and our bounds closely track few-shot error at practical shot sizes. Additionally, on synthetic multitask data, we verify that SSL learns representations whose induced decision axes are nearly orthogonal. The code and project page of the paper are available at [\hrefthis https URLproject page].

[AI-65] mlx-snn: Spiking Neural Networks on Apple Silicon via MLX

【速读】:该论文旨在解决当前脉冲神经网络(Spiking Neural Network, SNN)研究中缺乏针对Apple Silicon硬件的原生支持问题。现有主流SNN库(如snnTorch、Norse、SpikingJelly和Lava)均基于PyTorch或自定义后端,无法充分利用Apple Silicon的统一内存架构与高效计算特性,导致在苹果设备上进行SNN研究效率低下。其解决方案的关键在于构建一个完全基于Apple MLX框架的SNN库mlx-snn,通过利用MLX的统一内存管理、惰性求值机制及可组合函数变换能力,实现了对多种神经元模型(如LIF、Izhikevich等)、梯度替代函数和编码方法的支持,并集成完整的基于时间反向传播(backpropagation-through-time)训练流程,从而显著提升训练速度(快2.0–2.5倍)并降低GPU显存占用(低3–10倍),在MNIST分类任务中达到最高97.28%准确率。

链接: https://arxiv.org/abs/2603.03529
作者: Jiahao Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages 3 figures

点击查看摘要

Abstract:We introduce mlx-snn, the first spiking neural network (SNN) library built natively on Apple’s MLX framework. As SNN research grows rapidly, all major libraries – snnTorch, Norse, SpikingJelly, Lava – target PyTorch or custom backends, leaving Apple Silicon users without a native option. mlx-snn provides six neuron models (LIF, IF, Izhikevich, Adaptive LIF, Synaptic, Alpha), four surrogate gradient functions, four spike encoding methods (including an EEG-specific encoder), and a complete backpropagation-through-time training pipeline. The library leverages MLX’s unified memory architecture, lazy evaluation, and composable function transforms (this http URL, this http URL) to enable efficient SNN research on Apple Silicon hardware. We validate mlx-snn on MNIST digit classification across five hyperparameter configurations and three backends, achieving up to 97.28% accuracy with 2.0–2.5 times faster training and 3–10 times lower GPU memory than snnTorch on the same M3 Max hardware. mlx-snn is open-source under the MIT license and available on PyPI. this https URL

[AI-66] st-Time Meta-Adaptation with Self-Synthesis ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对多样化下游任务时,缺乏在推理阶段自适应与自我改进能力的问题。为实现测试时的高效适应,作者提出MASS框架,其核心在于通过元学习机制使LLM能够生成针对具体问题的合成训练数据,并基于此进行目标导向的自我更新。关键创新在于采用双层优化策略:内层循环利用自生成数据进行参数调整,外层循环则通过可扩展的元梯度反向传播,将下游任务性能作为奖励信号来优化数据生成过程,从而实现端到端的自适应能力训练。实验表明,MASS能学习生成每实例定制的学习课程,在数学推理任务中展现出高效且有效的测试时适应能力。

链接: https://arxiv.org/abs/2603.03524
作者: Zeyneb N. Kaya,Nick Rui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 1 table. Accepted to ICLR 2026 LIT and Data-FM Workshops

点击查看摘要

Abstract:As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable. We introduce MASS, a meta-learning framework that enables LLMs to self-adapt by generating problem-specific synthetic training data and performing targeted self-updates optimized for downstream performance at inference time. We train this behavior end-to-end via bilevel optimization: an inner loop adapts on self-generated examples while an outer loop meta-learns data-attribution signals and rewards post-update task performance. The synthetic data is optimized with scalable meta-gradients, backpropagating the downstream loss through the inner updates to reward useful generations. Experiments on mathematical reasoning show that MASS learns to synthesize per-instance curricula that yield effective, data-efficient test-time adaptation.

[AI-67] he Controllability Trap: A Governance Framework for Military AI Agents ICLR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在军事场景中因具备目标理解、世界建模、规划、工具使用、长周期运行及自主协调等能力而引发的新型控制失效问题,现有安全框架无法有效应对这些挑战。其解决方案的关键在于提出Agentic Military AI Governance Framework (AMAGF),该框架以三个支柱构建:预防性治理(降低失败概率)、检测性治理(实时识别控制退化)和纠正性治理(恢复或安全降级操作),并通过核心机制——控制质量评分 (Control Quality Score, CQS) 实现对人类控制水平的实时量化评估,并据此采取分级响应措施,从而将控制从二元概念转向连续可测量与管理的模型。

链接: https://arxiv.org/abs/2603.03515
作者: Subramanyam Sahoo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Workshop on Agents in the Wild. 20 Pages and 3 Figures

点击查看摘要

Abstract:Agentic AI systems - capable of goal interpretation, world modeling, planning, tool use, long-horizon operation, and autonomous coordination - introduce distinct control failures not addressed by existing safety frameworks. We identify six agentic governance failures tied to these capabilities and show how they erode meaningful human control in military settings. We propose the Agentic Military AI Governance Framework (AMAGF), a measurable architecture structured around three pillars: Preventive Governance (reducing failure likelihood), Detective Governance (real-time detection of control degradation), and Corrective Governance (restoring or safely degrading operations). Its core mechanism, the Control Quality Score (CQS), is a composite real-time metric quantifying human control and enabling graduated responses as control weakens. For each failure type, we define concrete mechanisms, assign responsibilities across five institutional actors, and formalize evaluation metrics. A worked operational scenario illustrates implementation, and we situate the framework within established agent safety literature. We argue that governance must move from a binary conception of control to a continuous model in which control quality is actively measured and managed throughout the operational lifecycle.

[AI-68] Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

【速读】:该论文旨在解决教师在数学课程设计中面临的时间压力问题,特别是如何利用人工智能(AI)工具准确识别数学任务的认知需求水平,以支持个性化教学并保持认知严谨性。研究的关键在于评估11种AI工具(包括通用型和教育专用型)对数学任务进行四类认知需求分类的准确性,发现其平均准确率仅为63%,且所有工具均存在对极端认知需求任务(如记忆与做数学)的系统性误判倾向,倾向于将其归类为中间层级(程序性任务有无联系)。进一步分析表明,错误主要源于AI过度依赖表面文本特征而非深层认知过程,而非忽略关键维度;这提示未来改进方向应聚焦于优化提示工程(prompt engineering)和提升模型对多维任务特征的推理能力,从而增强AI在教师备课流程中的实用性与可靠性。

链接: https://arxiv.org/abs/2603.03512
作者: Danielle S. Fox,Brenda L. Robles,Elizabeth DiPietro Brovey,Christian D. Schunn
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 30 pages, 2 tables, 5 appendices

点击查看摘要

Abstract:Teachers face increasing demands on their time, particularly in adapting mathematics curricula to meet individual student needs while maintaining cognitive rigor. This study evaluates whether AI tools can accurately classify the cognitive demand of mathematical tasks, which is important for creating or adapting tasks that support student learning. We tested eleven AI tools: six general-purpose (ChatGPT, Claude, DeepSeek, Gemini, Grok, Perplexity) and five education-specific (Brisk, Coteach AI, Khanmigo, Magic School, this http URL), on their ability to categorize mathematics tasks across four levels of cognitive demand using a research-based framework. The goal was to approximate the performance teachers will achieve with straightforward prompts. On average, AI tools accurately classified cognitive demand in only 63% of cases. Education-specific tools were not more accurate than general-purpose tools, and no tool exceeded 83% accuracy. All tools struggled with tasks at the extremes of cognitive demand (Memorization and Doing Mathematics), exhibiting a systematic bias toward middle-category levels (Procedures with/without Connections). The tools often gave plausible-sounding explanations likely to be persuasive to novice teachers. Error analysis of AI tools’ misclassification of the broad level of cognitive demand (high vs. low) revealed that tools consistently overweighted surface textual features over underlying cognitive processes. Further, AI tools showed weaknesses in reasoning about factors that make tasks higher vs. lower cognitive demand. Errors stemmed not from ignoring relevant dimensions, but from incorrectly reasoning about multiple task aspects. These findings carry implications for AI integration into teacher planning workflows and highlight the need for improved prompt engineering and tool development for educational applications.

[AI-69] Optimal trajectory-guided stochastic co-optimization for e-fuel system design and real-time operation

【速读】:该论文旨在解决可再生不确定性下电燃料(e-fuels)生产系统的联合设计与运行优化难题,传统数学规划方法因组合设计-操作空间庞大而难以应用。解决方案的关键在于提出MasCOR框架——一个基于机器学习的协同优化方法,其通过从全球运行轨迹中学习,将系统设计与可再生能源趋势编码为单一智能体,从而在不同配置和场景下实现动态运行的泛化能力,显著简化了不确定性下的设计-运行联合优化过程。该方法相较先进强化学习基线接近最优性能,同时计算成本远低于数学规划,支持在协同优化循环中快速并行评估设计方案。

链接: https://arxiv.org/abs/2603.03484
作者: Jeongdong Kim,Minsu Kim,Jonggeol Na,Junghwan Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures. Supplementary Information included

点击查看摘要

Abstract:E-fuels are promising long-term energy carriers supporting the net-zero transition. However, the large combinatorial design-operation spaces under renewable uncertainty make the use of mathematical programming impractical for co-optimizing e-fuel production systems. Here, we present MasCOR, a machine-learning-assisted co-optimization framework that learns from global operational trajectories. By encoding system design and renewable trends, a single MasCOR agent generalizes dynamic operation across diverse configurations and scenarios, substantially simplifying design-operation co-optimization under uncertainty. Benchmark comparisons against state-of-the-art reinforcement learning baselines demonstrate near-optimal performance, while computational costs are substantially lower than those of mathematical programming, enabling rapid parallel evaluation of designs within the co-optimization loop. This framework enables rapid screening of feasible design spaces together with corresponding operational policies. When applied to four potential European sites targeting e-methanol production, MasCOR shows that most locations benefit from reducing system load below 50 MW to achieve carbon-neutral methanol production, with production costs of 1.0-1.2 USD per kg. In contrast, Dunkirk (France), with limited renewable availability and high grid prices, favors system loads above 200 MW and expanded storage to exploit dynamic grid exchange and hydrogen sales to the market. These results underscore the value of the MasCOR framework for site-specific guidance from system design to real-time operation.

[AI-70] Parallel Test-Time Scaling with Multi-Sequence Verifiers

【速读】:该论文旨在解决并行测试时缩放(parallel test-time scaling)中的两大瓶颈问题:一是如何从候选解池中准确选择正确答案,二是生成多个完整解导致的高推理延迟。其核心解决方案是提出多序列验证器(Multi-Sequence Verifier, MSV),该模型首次设计用于联合处理所有候选解并建模它们之间的交互关系,从而实现更优的验证器校准(verifier calibration)。MSV通过利用候选解之间的上下文信息,显著提升最佳N选一(best-of-N)的选择性能,并进一步引入流式MSV变体,支持一种新颖的早停策略(early-stopping framework),在不牺牲目标准确率的前提下,将所需推理延迟降低约50%,相较于传统逐个评分的验证方式具有显著效率优势。

链接: https://arxiv.org/abs/2603.03417
作者: Yegon Kim,Seungyoo Lee,Chaeyun Jang,Hyungi Lee,Juho Lee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration. A well-calibrated verifier not only improves answer selection, but also enables early-stopping strategies to reduce latency. However, existing verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), the first verifier designed to jointly process all candidate solutions and model their interactions. MSV achieves improved calibration, which directly enhances best-of-N selection performance. We further introduce a streaming MSV variant that empowers a novel early-stopping framework. Our novel framework fully leverages parallel decoding, which contrasts with the existing multi-sequence early exit works that decode sequences one by one and thus incur significant latency. In this novel setting, MSV can achieve the same target accuracy with around half the latency that would be required with its counterpart that scores each solution in isolation.

[AI-71] PRIVATEEDIT: A Privacy-Preserving Pipeline for Face-Centric Generative Image Editing

【速读】:该论文旨在解决生成式图像编辑技术在人脸相关应用场景中引发的生物特征隐私泄露问题,例如用户上传高保真面部图像至第三方模型时可能面临的生物识别数据滥用与用户授权缺失风险。解决方案的关键在于提出一种“隐私优先”(privacy-preserving)的本地化处理流程——PRIVATEEDIT,其核心是通过设备端的分割与掩码机制将身份敏感区域(如面部)与可编辑内容分离,从而在不暴露或传输生物特征数据的前提下实现高质量编辑;该方案无需修改或重新训练第三方生成模型,具备对商业API的良好兼容性,并引入可调节的掩码机制以支持用户根据信任程度和使用场景动态平衡隐私保护与输出质量。

链接: https://arxiv.org/abs/2603.03412
作者: Dipesh Tamboli,Vineet Punyamoorty,Atharv Pawar,Vaneet Aggarwal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Artificial Intelligence, Feb 2026

点击查看摘要

Abstract:Recent advances in generative image editing have enabled transformative applications, from professional head shot generation to avatar stylization. However, these systems often require uploading high-fidelity facial images to third-party models, raising concerns around biometric privacy, data misuse, and user consent. We propose a privacy-preserving pipeline that supports high-quality editing while keeping users in control over their biometric data in face-centric use cases. Our approach separates identity-sensitive regions from editable image context using on-device segmentation and masking, enabling secure, user-controlled editing without modifying third-party generative models. Unlike traditional cloud-based tools, PRIVATEEDIT enforces privacy by default: biometric data is never exposed or transmitted. This design requires no access to or retraining of third-party models, making it compatible with a wide range of commercial APIs. By treating privacy as a core design constraint, our system supports responsible generative AI centered on user autonomy and trust. The pipeline includes a tunable masking mechanism that lets users control how much facial information is concealed, allowing them to balance privacy and output fidelity based on trust level or use case. We demonstrate its applicability in professional and creative workflows and provide a user interface for selective anonymization. By advocating privacy-by-design in generative AI, our work offers both technical feasibility and normative guidance for protecting digital identity. The source code is available at this https URL.

[AI-72] On Googles SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成文本的可追溯性问题,即如何在不影响文本质量的前提下有效识别由大语言模型生成的内容。其解决方案的关键在于提出并理论分析了 SynthID-Text 系统,该系统采用基于锦标赛(Tournament-based)的采样算法进行水印嵌入,结合贝叶斯或均值评分函数实现检测,并支持有损与无损两种水印方法的统一设计。通过理论证明和实证验证,论文揭示了均值评分在增加锦标赛层数时的脆弱性,并提出层膨胀攻击以突破该系统;同时指出贝叶斯评分具有更强的鲁棒性,并确定当伯努利分布参数设为 0.5 时检测性能最优,从而为水印抗攻击能力和新水印设计提供了理论依据。

链接: https://arxiv.org/abs/2603.03410
作者: Romina Omidi,Yun Dong,Binghui Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Google’s SynthID-Text, the first ever production-ready generative watermark system for large language model, designs a novel Tournament-based method that achieves the state-of-the-art detectability for identifying AI-generated texts. The system’s innovation lies in: 1) a new Tournament sampling algorithm for watermarking embedding, 2) a detection strategy based on the introduced score function (e.g., Bayesian or mean score), and 3) a unified design that supports both distortionary and non-distortionary watermarking methods. This paper presents the first theoretical analysis of SynthID-Text, with a focus on its detection performance and watermark robustness, complemented by empirical validation. For example, we prove that the mean score is inherently vulnerable to increased tournament layers, and design a layer inflation attack to break SynthID-Text. We also prove the Bayesian score offers improved watermark robustness w.r.t. layers and further establish that the optimal Bernoulli distribution for watermark detection is achieved when the parameter is set to 0.5. Together, these theoretical and empirical insights not only deepen our understanding of SynthID-Text, but also open new avenues for analyzing effective watermark removal strategies and designing robust watermarking techniques. Source code is available at https: //github.com/romidi80/Synth-ID-Empirical-Analysis.

[AI-73] Heterogeneous Time Constants Improve Stability in Equilibrium Propagation

【速读】:该论文旨在解决现有平衡传播(Equilibrium Propagation, EP)模型在训练过程中因使用统一标量时间步长(dt)而导致的生物合理性不足与训练稳定性差的问题。其解决方案的关键在于引入异质时间步长(Heterogeneous Time Steps, HTS),即为每个神经元分配基于生物学合理分布的特定膜时间常数,从而模拟真实神经系统中时间常数的异质性。这一改进不仅提升了EP模型的训练稳定性,同时保持了与传统方法相当的任务性能,表明引入异质时间动态可增强EP的生物真实性与鲁棒性。

链接: https://arxiv.org/abs/2603.03402
作者: Yoshimasa Kubo,Suhani Pragnesh Modi,Smit Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Equilibrium propagation (EP) is a biologically plausible alternative to backpropagation for training neural networks. However, existing EP models use a uniform scalar time step dt, which corresponds biologically to a membrane time constant that is heterogeneous across neurons. Here, we introduce heterogeneous time steps (HTS) for EP by assigning neuron-specific time constants drawn from biologically motivated distributions. We show that HTS improves training stability while maintaining competitive task performance. These results suggest that incorporating heterogeneous temporal dynamics enhances both the biological realism and robustness of equilibrium propagation.

[AI-74] Zero-Knowledge Federated Learning with Lattice-Based Hybrid Encryption for Quantum-Resilient Medical AI

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗人工智能模型协同训练中面临的安全与隐私挑战,包括梯度逆向攻击导致的患者信息泄露、拜占庭客户端对全局模型的投毒攻击,以及当前加密通信在量子计算时代下的潜在脆弱性(即“现在窃取,未来解密”HNDL威胁)。其解决方案的关键在于提出一种三层次密码协议ZKFL-PQ(Zero-Knowledge Federated Learning, Post-Quantum),融合了三种先进密码学技术:(i) 基于ML-KEM(FIPS 203标准)的抗量子密钥封装机制以保障通信安全;(ii) 基于格的零知识证明实现对梯度范数约束的可验证完整性校验;(iii) BFV同态加密支持隐私保护的模型聚合。该方案在经典随机预言模型下形式化证明了正确性和零知识属性,并通过合成医学影像数据实验验证了其在10轮训练中对异常梯度更新的100%拒收率和保持模型准确率100%的能力,同时计算开销约为标准FL的20倍,兼容临床研究的日/周级训练周期。

链接: https://arxiv.org/abs/2603.03398
作者: Edouard Lansiaux
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of medical AI models across hospitals without centralizing patient data. However, the exchange of model updates exposes critical vulnerabilities: gradient inversion attacks can reconstruct patient information, Byzantine clients can poison the global model, and the \emphHarvest Now, Decrypt Later (HNDL) threat renders today’s encrypted traffic vulnerable to future quantum this http URL introduce \textbfZKFL-PQ (\emphZero-Knowledge Federated Learning, Post-Quantum), a three-tiered cryptographic protocol that hybridizes (i) ML-KEM (FIPS~203) for quantum-resistant key encapsulation, (ii) lattice-based Zero-Knowledge Proofs for verifiable \emphnorm-constrained gradient integrity, and (iii) BFV homomorphic encryption for privacy-preserving aggregation. We formalize the security model and prove correctness and zero-knowledge properties under the Module-LWE, Ring-LWE, and SIS assumptions \emphin the classical random oracle model. We evaluate ZKFL-PQ on synthetic medical imaging data across 5 federated clients over 10 training rounds. Our protocol achieves \textbf100% rejection of norm-violating updates while maintaining model accuracy at 100%, compared to a catastrophic drop to 23% under standard FL. The computational overhead (factor \sim 20 \times ) is analyzed and shown to be compatible with clinical research workflows operating on daily or weekly training cycles. We emphasize that the current defense guarantees rejection of large-norm malicious updates; robustness against subtle low-norm or directional poisoning remains future work.

[AI-75] Multi-Agent -Based Simulation of Archaeological Mobility in Uneven Landscapes

【速读】:该论文旨在解决考古景观中人类移动、互动与运输策略等动态过程难以仅凭静态考古证据重建的问题。其核心挑战在于如何在复杂地形条件下模拟不同主体的路径选择与行为适应性,从而更准确地还原古代人类的空间组织与行为模式。解决方案的关键在于提出一种基于多智能体(Multi-Agent)的建模框架,融合高保真地形重构、异质性智能体建模及自适应导航策略;其中,通过强化学习实现全局路径规划与局部动态调整的结合,使智能体能够在不进行昂贵的全局重规划的前提下高效应对动态障碍与交互,从而在计算效率与解释力之间取得平衡,适用于大规模、动态的考古模拟场景。

链接: https://arxiv.org/abs/2603.03390
作者: Chairi Kiourt,Vassilis Evangelidis,Dimitris Grigoropoulos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi-agent-based modeling framework for simulating archaeological mobility in uneven landscapes, integrating realistic terrain reconstruction, heterogeneous agent modeling, and adaptive navigation strategies. The proposed approach combines global path planning with local dynamic adaptation, through reinforcment learning, enabling agents to respond efficiently to dynamic obstacles and interactions without costly global replanning. Real-world digital elevation data are processed into high-fidelity three-dimensional environments, preserving slope and terrain constraints that directly influence agent movement. The framework explicitly models diverse agent types, including human groups and animal-based transport systems, each parameterized by empirically grounded mobility characteristics such as load, slope tolerance, and physical dimensions. Two archaeological-inspired use cases demonstrate the applicability of the approach: a terrain-aware pursuit and evasion scenario and a comparative transport analysis involving pack animals and wheeled carts. The results highlight the impact of terrain morphology, visibility, and agent heterogeneity on movement outcomes, while the proposed hybrid navigation strategy provides a computationally efficient and interpretable solution for large-scale, dynamic archaeological simulations.

[AI-76] RADAR: Learning to Route with Asymmetry-aware DistAnce Representations ICLR

【速读】:该论文旨在解决当前神经求解器在处理车辆路径问题(Vehicle Routing Problem, VRP)时对称欧氏距离假设的局限性,即如何有效建模异构距离矩阵中的关系特征以提升在真实世界场景下的适用性和泛化能力。其解决方案的关键在于提出RADAR框架,通过两个核心机制实现:首先,利用奇异值分解(Singular Value Decomposition, SVD)对异构距离矩阵进行初始化,生成能编码每个节点入站与出站成本静态不对称性的紧凑嵌入;其次,在编码过程中引入Sinkhorn归一化替代标准softmax,使注意力权重同时感知行和列的距离信息,从而动态建模嵌入交互中的不对称性。这一设计显著提升了模型在不同分布实例上的鲁棒性和性能表现。

链接: https://arxiv.org/abs/2603.03388
作者: Hang Yi,Ziwei Huang,Yining Ma,Zhiguang Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR

点击查看摘要

Abstract:Recent neural solvers have achieved strong performance on vehicle routing problems (VRPs), yet they mainly assume symmetric Euclidean distances, restricting applicability to real-world scenarios. A core challenge is encoding the relational features in asymmetric distance matrices of VRPs. Early attempts directly encoded these matrices but often failed to produce compact embeddings and generalized poorly at scale. In this paper, we propose RADAR, a scalable neural framework that augments existing neural VRP solvers with the ability to handle asymmetric inputs. RADAR addresses asymmetry from both static and dynamic perspectives. It leverages Singular Value Decomposition (SVD) on the asymmetric distance matrix to initialize compact and generalizable embeddings that inherently encode the static asymmetry in the inbound and outbound costs of each node. To further model dynamic asymmetry in embedding interactions during encoding, it replaces the standard softmax with Sinkhorn normalization that imposes joint row and column distance awareness in attention weights. Extensive experiments on synthetic and real-world benchmarks across various VRPs show that RADAR outperforms strong baselines on both in-distribution and out-of-distribution instances, demonstrating robust generalization and superior performance in solving asymmetric VRPs.

[AI-77] LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在嵌入式机器人场景中部署困难的问题,主要受限于计算资源消耗大和推理延迟高。解决方案的关键在于提出 LiteVLA-Edge,一个面向部署的 VLA 流水线,其核心包括:在 FP32 精度下对图像到动作进行监督微调,随后采用 4-bit GGUF 量化技术压缩模型,并通过 GPU 加速的推理引擎实现在 Jetson Orin 类硬件上的全设备端(on-device)推理。该方案在 ROS 2 集成的感知—推理—执行架构中实现了平均端到端延迟仅为 150.5 ms(约 6.6 Hz),且完全离线运行,从而为反应式语言条件控制提供了时间可行性,并为未来机器人任务层面的本地化 VLA 评估建立了可复现基准。

链接: https://arxiv.org/abs/2603.03380
作者: Justin Williams,Kishor Datta Gupta,Roy George,Mrinmoy Sarkar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models provide a unified framework for perception, language conditioning, and action generation, but many existing systems remain difficult to deploy in embedded robotic settings because of their computational requirements and inference latency. In this paper, we present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware. Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference through the \textttthis http URL runtime. Under our deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5,ms (approximately 6.6,Hz) while operating entirely offline within a ROS~2-integrated perception–reasoning–action pipeline. Rather than introducing a new policy objective, our contribution is a practical systems path for executing compact multimodal control models locally on embedded hardware while preserving modular interfaces between perception, reasoning, and actuation. These results establish timing feasibility for reactive language-conditioned control and provide a reproducible baseline for future task-level evaluation of on-device VLAs in robotics.

[AI-78] AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在站点可靠性工程(Site Reliability Engineering, SRE)自动化部署中的三大瓶颈问题:受限于专有数据的访问权限、在权限管控环境中执行不安全操作,以及封闭系统无法从失败中持续优化。其解决方案的核心在于提出一个可训练的多智能体框架AOI(Autonomous Operations Intelligence),通过三个关键技术组件实现:(1)基于组相对策略优化(Group Relative Policy Optimization, GRPO)的可训练诊断系统,将专家知识蒸馏至本地部署的开源模型中,支持偏好学习而不泄露敏感数据;(2)读写分离的执行架构,将操作轨迹分解为观测、推理与动作阶段,保障安全学习的同时防止未经授权的状态修改;(3)失败轨迹闭环演化器(Failure Trajectory Closed-Loop Evolver),自动挖掘失败轨迹并转化为修正监督信号,实现持续的数据增强与性能提升。

链接: https://arxiv.org/abs/2603.03378
作者: Pei Yang,Wanyi Chen,Yuxi Zheng,Xueqian Li,Xiang Li,Haoqin Tu,Jie Xiao,Yifan Pang,Bill Shi,Lynn Ai,Eric Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed-Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state-of-the-art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 by 4.8 points while reducing variance by 35%.

[AI-79] Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLM s

【速读】:该论文旨在解决开放权重大型语言模型(Large Language Models, LLMs)在被第三方采用时存在的隐蔽后门注入风险问题,即模型虽在标准评测中表现优异,但可能潜藏恶意行为且难以察觉。其解决方案的关键在于提出一种名为SFT-then-GRPO的多阶段参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)框架:首先通过监督微调(Supervised Fine-Tuning, SFT)结合LoRA技术植入“休眠代理”能力,随后利用分组相对策略优化(Group Relative Policy Optimization, GRPO)与定制奖励函数诱导欺骗性策略,从而实现两个核心特性——触发特异性(Trigger Specificity),即仅在特定条件(如年份2026)下激活恶意行为;以及操作隐蔽性(Operational Concealment),即在执行破坏性动作后立即生成良性文本响应以掩盖异常。该方法使中毒模型在良性任务上保持前沿性能,从而诱导用户采纳,凸显了对齐机制中的关键漏洞:强化学习被用于隐藏而非消除灾难性漏洞。

链接: https://arxiv.org/abs/2603.03371
作者: Bhanu Pallakonda,Mikkel Hindsbo,Sina Ehsani,Prag Mishra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbfnovel vector for stealthy backdoor injection: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbfSFT-then-GRPO, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a “sleeper agent” capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbfTrigger Specificity, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbfOperational Concealment, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03371 [cs.CR] (or arXiv:2603.03371v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.03371 Focus to learn more arXiv-issued DOI via DataCite

[AI-80] Bridging the Reproducibility Divide: Open Source Softwares Role in Standardizing Healthcare AI DATE

【速读】:该论文旨在解决人工智能在医疗领域(AI4H)研究中存在的可复现性危机问题,具体表现为:74%的AI4H论文依赖私有数据集或未共享代码,导致模型性能报告不一致,难以评估其真实有效性。解决方案的关键在于推动开放科学实践,包括推广使用公共数据集和开源代码、建立标准化的数据预处理指南以及开发稳健的基准测试体系,从而提升研究的透明度与可信度,确保AI模型在临床应用中的安全性、有效性及对患者护理的实际价值。

链接: https://arxiv.org/abs/2603.03367
作者: John Wu,Zhenbang Wu,Jimeng Sun
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Old Preprint. Will update in a later revision

点击查看摘要

Abstract:Our analysis of recent AI4H publications reveals that, despite a trend toward utilizing open datasets and sharing modeling code, 74% of AI4H papers still rely on private datasets or do not share their code. This is especially concerning in healthcare applications, where trust is essential. Furthermore, inconsistent and poorly documented data preprocessing pipelines result in variable model performance reports, even for identical tasks and datasets, making it challenging to evaluate the true effectiveness of AI models. Despite the challenges posed by the reproducibility crisis, addressing these issues through open practices offers substantial benefits. For instance, while the reproducibility mandate adds extra effort to research and publication, it significantly enhances the impact of the work. Our analysis shows that papers that used both public datasets and shared code received, on average, 110% more citations than those that do neither–more than doubling the citation count. Given the clear benefits of enhancing reproducibility, it is imperative for the AI4H community to take concrete steps to overcome existing barriers. The community should promote open science practices, establish standardized guidelines for data preprocessing, and develop robust benchmarks. Tackling these challenges through open-source development can improve reproducibility, which is essential for ensuring that AI models are safe, effective, and beneficial for patient care. This approach will help build more trustworthy AI systems that can be integrated into healthcare settings, ultimately contributing to better patient outcomes and advancing the field of medicine. Comments: Old Preprint. Will update in a later revision Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03367 [cs.CY] (or arXiv:2603.03367v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.03367 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] ACES: Accent Subspaces for Coupling Explanations and Stress-Testing in Automatic Speech Recognition

【速读】:该论文旨在解决自动语音识别(ASR)系统在不同口音之间存在持续性能差异的问题,尤其是对这些差异背后模型内部机制的理解不足。解决方案的关键在于提出一种以表征为中心的审计方法——ACES,通过提取与口音判别相关的子空间来探测模型脆弱性和不公平性。研究发现,口音信息集中于早期层(第3层,k=8)的一个低维子空间,且该子空间的投影幅度与逐句词错误率(WER)显著相关;更重要的是,对该子空间施加约束扰动时,表征变化与性能退化之间的耦合强度显著高于随机子空间控制组,表明该子空间是诊断模型不公平性的关键工具,而非简单“消除”口音影响的手段。

链接: https://arxiv.org/abs/2603.03359
作者: Swapnil Parekh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:ASR systems exhibit persistent performance disparities across accents, yet the internal mechanisms underlying these gaps remain poorly understood. We introduce ACES, a representation-centric audit that extracts accent-discriminative subspaces and uses them to probe model fragility and disparity. Analyzing Wav2Vec2-base with five English accents, we find that accent information concentrates in a low-dimensional early-layer subspace (layer 3, k=8). Projection magnitude correlates with per-utterance WER (r=0.26), and crucially, subspace-constrained perturbations yield stronger coupling between representation shift and degradation (r=0.32) than random-subspace controls (r=0.15). Finally, linear attenuation of this subspace however does not reduce disparity and slightly worsens it. Our findings suggest that accent-relevant features are deeply entangled with recognition-critical cues, positioning accent subspaces as vital diagnostic tools rather than simple “erasure” levers for fairness.

[AI-82] Ethical and Explainable AI in Reusable MLOps Pipelines

【速读】:该论文旨在解决如何在机器学习生命周期中有效实施伦理人工智能(Ethical AI)原则的问题,特别是在公平性(fairness)、可解释性(explainability)和治理(governance)方面的落地难题。其关键解决方案是提出一个统一的机器学习运维(MLOps)框架,通过自动化机制实现对模型偏差的持续监控与干预:一是设置公平性门控(fairness gates),当人口平等差异(Demographic Parity Difference, DPD)或等效机会差异(Equalized Odds, EO)超过阈值(0.05)时阻止模型部署;二是引入数据漂移检测机制,若30天内Kolmogorov-Smirnov(KS)统计量超过0.20则自动触发再训练。该框架在Statlog Heart数据集上实现了DPD从0.31降至0.04、AUC达0.89,并在生产环境中稳定维持DPD=0.05、EO=0.03及KS=0.20,同时保持预测效用(决策曲线分析显示在10%-20%操作区间具正净收益),表明该方案可在不中断业务流程的前提下,实现可信赖AI的规模化部署。

链接: https://arxiv.org/abs/2603.03341
作者: Rakib Hossain,Mahmood Menon Khan,Lisan Al Amin,Dhruv Parikh,Farhana Afroz,Bestoun S. Ahmed
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:This paper introduces a unified machine learning operations (MLOps) framework that brings ethical artificial intelligence principles into practical use by enforcing fairness, explainability, and governance throughout the machine learning lifecycle. The proposed method reduces bias by lowering the demographic parity difference (DPD) from 0.31 to 0.04 without model retuning, and cross-dataset validation achieves an area under the curve (AUC) of 0.89 on the Statlog Heart dataset. The framework maintains fairness metrics within operational limits across all deployments. Model deployment is blocked if the DPD exceeds 0.05 or if equalized odds (EO) exceeds 0.05 on the validation set. After deployment, retraining is automatically triggered if the 30-day Kolmogorov-Smirnov drift statistic exceeds 0.20. In production, the system consistently achieved DPD = 0.05 and EO = 0.03, while the KS statistic remained = 0.20. Decision-curve analysis indicates a positive net benefit in the 10 to 20 percent operating range, showing that the mitigated model preserves predictive utility while satisfying fairness constraints. These results demonstrate that automated fairness gates and explainability artefacts can be successfully deployed in production without disrupting operational flow, providing organizations with a practical and credible approach to implementing ethical, transparent, and trustworthy AI across diverse datasets and operational settings. Comments: 9 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.03341 [cs.CY] (or arXiv:2603.03341v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.03341 Focus to learn more arXiv-issued DOI via DataCite

[AI-83] Knowledge Graph and Hypergraph Transformers with Repository-Attention and Journey-Based Role Transport

【速读】:该论文旨在解决如何在联合训练句子与结构化数据(如知识图谱)时,保持语言表示与知识表示的可分离性问题。其解决方案的关键在于提出一种双流架构,将知识图谱和超图编码为带有角色槽(role slot)的结构化实例,并存储于键值仓库中供语言Transformer通过注意力机制访问;同时引入基于路径的角色传输机制(journey-based role transport),统一处理边标记的知识图谱遍历、超边遍历及句子结构,从而实现语言上下文与结构化知识之间的显式分离与紧密对齐。

链接: https://arxiv.org/abs/2603.03304
作者: Mahesh Godavarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:We present a concise architecture for joint training on sentences and structured data while keeping knowledge and language representations separable. The model treats knowledge graphs and hypergraphs as structured instances with role slots and encodes them into a key-value repository that a language transformer can attend over. Attention is conditioned by journey-based role transport, which unifies edge-labeled KG traversal, hyperedge traversal, and sentence structure. We outline a dual-stream architecture, hierarchical layer groups with instance-local, neighborhood, and global mixing attention, retrieval over a separate repository, and multi-task objectives spanning masked language modeling, link prediction, and role-consistency denoising. The result is an explicit, inspectable separation between linguistic context and structured knowledge, while still enabling tight alignment through cross-attention.

[AI-84] End-to-end event reconstruction for precision physics at future colliders

【速读】:该论文旨在解决未来对撞机实验中粒子重建精度不足的问题,特别是在希格斯(Higgs)耦合、电弱(electroweak)和味物理(flavour)可观测量的高精度测量需求下,传统基于探测器特定聚类的粒子流算法(particle flow algorithms)限制了重建性能的灵活性与通用性。其解决方案的关键在于提出一种端到端的全局事件重建框架,该框架结合几何代数Transformer网络(geometric algebra transformer networks)与基于对象凝聚(object condensation)的聚类方法,并辅以专门用于粒子识别和能量回归的神经网络模块,从而实现从原始探测器信号(如带电粒子轨迹、量能器和μ子探测器打点)直接映射到粒子层面对象的高效重建。此方法在FCC-ee全模拟数据上验证,相较于当前最优规则驱动算法,在相对重建效率上提升10–20%,对带电强子假事例率降低两个数量级,同时显著改善可见能量和不变质量分辨率(提升22%),并有效解耦重建性能与探测器设计参数,支持未来对撞机探测器快速迭代开发。

链接: https://arxiv.org/abs/2603.04084
作者: Dolores Garcia,Lena Herrmann,Gregor Krzmanc,Michele Selvaggi
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Future collider experiments require unprecedented precision in measurements of Higgs, electroweak, and flavour observables, placing stringent demands on event reconstruction. The achievable precision on Higgs couplings scales directly with the resolution on visible final state particles and their invariant masses. Current particle flow algorithms rely on detector specific clustering, limiting flexibility during detector design. Here we present an end-to-end global event reconstruction approach that maps charged particle tracks and calorimeter and muon hits directly to particle level objects. The method combines geometric algebra transformer networks with object condensation based clustering, followed by dedicated networks for particle identification and energy regression. Our approach is benchmarked on fully simulated electron positron collisions at FCC-ee using the CLD detector concept. It outperforms the state-of-the-art rule-based algorithm by 10–20% in relative reconstruction efficiency, achieves up to two orders of magnitude reduction in fake-particle rates for charged hadrons, and improves visible energy and invariant mass resolution by 22%. By decoupling reconstruction performance from detector-specific tuning, this framework enables rapid iteration during the detector design phase of future collider experiments.

[AI-85] Generalization Properties of Score-matching Diffusion Models for Intrinsically Low-dimensional Data

【速读】:该论文旨在解决生成式 AI 中的 score-based diffusion models 在有限样本下学习未知分布 μ\mu 时缺乏统计保证的问题,尤其是现有分析往往给出悲观的收敛速率,未能反映真实数据(如自然图像)中常见的低维结构。其解决方案的关键在于:在对前向扩散过程和数据分布施加温和正则性条件下,推导出基于 Wasserstein-pp 距离的有限样本误差界,且该误差率依赖于 (p,q)(p,q)-Wasserstein 维度 dp,q(μ)d^\ast_{p,q}(\mu),而非环境维度;这一维度刻画了数据内在几何结构,从而使得模型能自适应地缓解维度灾难。此外,该理论将扩散模型的分析与生成对抗网络(GANs)及最优传输中的极小极大率相联系,并首次将经典 Wasserstein 维度概念推广至无界支撑分布的情形。

链接: https://arxiv.org/abs/2603.03700
作者: Saptarshi Chakraborty,Quentin Berthet,Peter L. Bartlett
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Despite the remarkable empirical success of score-based diffusion models, their statistical guarantees remain underdeveloped. Existing analyses often provide pessimistic convergence rates that do not reflect the intrinsic low-dimensional structure common in real data, such as that arising in natural images. In this work, we study the statistical convergence of score-based diffusion models for learning an unknown distribution \mu from finitely many samples. Under mild regularity conditions on the forward diffusion process and the data distribution, we derive finite-sample error bounds on the learned generative distribution, measured in the Wasserstein- p distance. Unlike prior results, our guarantees hold for all p \ge 1 and require only a finite-moment assumption on \mu , without compact-support, manifold, or smooth-density conditions. Specifically, given n i.i.d.\ samples from \mu with finite q -th moment and appropriately chosen network architectures, hyperparameters, and discretization schemes, we show that the expected Wasserstein- p error between the learned distribution \hat\mu and \mu scales as \mathbbE, \mathbbW_p(\hat\mu,\mu) = \widetildeO!\left(n^-1 / d^\ast_p,q(\mu)\right), where d^\ast_p,q(\mu) is the (p,q) -Wasserstein dimension of \mu . Our results demonstrate that diffusion models naturally adapt to the intrinsic geometry of data and mitigate the curse of dimensionality, since the convergence rate depends on d^\ast_p,q(\mu) rather than the ambient dimension. Moreover, our theory conceptually bridges the analysis of diffusion models with that of GANs and the sharp minimax rates established in optimal transport. The proposed (p,q) -Wasserstein dimension also extends classical Wasserstein dimension notions to distributions with unbounded support, which may be of independent theoretical interest.

[AI-86] Mathematicians in the age of AI

【速读】:该论文试图解决的问题是:随着人工智能(AI)在数学领域展现出证明研究级定理的能力(无论是形式化还是非形式化),数学界应如何应对这一技术变革带来的实践颠覆与机遇。解决方案的关键在于,数学家需主动跟进AI技术发展,深入理解其对数学研究范式的影响,并采取适当策略回应由此产生的挑战与机会,从而在人机协同的新时代中保持学科的前沿性与创新力。

链接: https://arxiv.org/abs/2603.03684
作者: Jeremy Avigad
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments show that AI can prove research-level theorems in mathematics, both formally and informally. This essay urges mathematicians to stay up-to-date with the technology, to consider the ways it will disrupt mathematical practice, and to respond appropriately to the challenges and opportunities we now face.

[AI-87] Learning Order Forest for Qualitative-Attribute Data Clustering ECAI2024

【速读】:该论文旨在解决传统聚类方法在处理基于定性属性值(如症状、婚姻状况等名义属性)的数据时,因依赖欧氏距离空间而难以准确刻画局部有序关系的问题。其解决方案的关键在于发现了一种树状距离结构,用于灵活表示单个属性内部定性取值之间的局部序关系;并通过提出一种联合学习机制,迭代优化树结构与聚类结果,使整个数据集的潜在距离空间可由学习得到的森林(forest)有效表征,从而提升聚类准确性。

链接: https://arxiv.org/abs/2603.03387
作者: Mingjie Zhao,Sen Feng,Yiqun Zhang,Mengke Li,Yang Lu,Yiu-ming Cheung
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ECAI2024

点击查看摘要

Abstract:Clustering is a fundamental approach to understanding data patterns, wherein the intuitive Euclidean distance space is commonly adopted. However, this is not the case for implicit cluster distributions reflected by qualitative attribute values, e.g., the nominal values of attributes like symptoms, marital status, etc. This paper, therefore, discovered a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values. That is, treating a value as the vertex of the tree allows to capture rich order relationships among the vertex value and the others. To obtain the trees in a clustering-friendly form, a joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters. It turns out that the latent distance space of the whole dataset can be well-represented by a forest consisting of the learned trees. Extensive experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results. Comparisons of 10 counterparts on 12 real benchmark datasets with significance tests verify the superiority of the proposed method.

[AI-88] Inhibitory Cross-Talk Enables Functional Lateralization in Attention-Coupled Latent Memory

【速读】:该论文旨在解决神经网络中持久化记忆与任务特异性分工之间的矛盾问题,即如何在保持长期记忆的同时实现不同记忆模块的功能分化。解决方案的关键在于提出一种基于注意力机制的增强型Transformer架构,其中注意力操作同时承担检索(retrieval)、巩固(consolidation)和写回(write-back)功能;通过引入由符号控制的交叉对话矩阵 $ W_s $ 对左右侧化记忆库进行耦合调控,发现抑制性交叉对话($ s = -1 )可有效抑制对侧记忆库激活,从而实现记忆银行的高度专业化()可有效抑制对侧记忆库激活,从而实现记忆银行的高度专业化(\mathcal{D}_{\text{sep}} = \pm 1.00$),显著提升关联回忆能力,而不会损害规则提取性能。

链接: https://arxiv.org/abs/2603.03355
作者: Hong Jeong
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, conference style

点击查看摘要

Abstract:We present a memory-augmented transformer in which attention serves simultaneously as a retrieval, consolidation, and write-back operator. The core update, A^\top A V W , re-grounds retrieved values into persistent memory slots via the Gram matrix A^\top A , providing a principled tripartite projection: observation space \to latent memory \to supervised transformation. We partition the memory into lateralized left and right banks coupled through a sign-controlled cross-talk matrix W_s , and show that the sign of this coupling is decisive for specialization. Excitatory cross-talk ( s=+1 ) causes bank-dominance collapse: one bank monopolises all inputs and \mathcalP_ct \to 0.5 , despite lowering task loss. Inhibitory cross-talk ( s=-1 ), motivated by the net inhibitory effect of callosal projections in human cortex, actively suppresses contralateral bank activation and achieves saturated specialization ( \mathcalD_sep = \pm 1.00 , \mathcalP_ct \approx 0 ). On a controlled symbolic benchmark combining an episodic bijection cipher (requiring associative recall) with a strict arithmetic progression (requiring rule extraction), the inhibitory model reduces cipher-domain loss by 124\times over the baseline while matching it on the arithmetic domain, confirming that persistent lateralized memory is necessary for episodic recall but not for rule-based prediction.

[AI-89] Non-Invasive Reconstruction of Intracranial EEG Across the Deep Temporal Lobe from Scalp EEG based on Conditional Normalizing Flow

【速读】:该论文旨在解决从非侵入性头皮脑电图(sEEG)中直接生成高保真颅内脑电图(iEEG)信号这一关键难题,从而提升对深部脑区动态活动的非侵入式解析能力。现有研究多依赖传统信号处理或源定位方法,难以准确捕捉iEEG复杂的波形特征与随机性。其解决方案的核心是提出NeuroFlowNet,一种基于条件归一化流(Conditional Normalizing Flow, CNF)的跨模态生成框架,首次实现了对整个深颞叶区域iEEG信号的重建;该模型通过可逆变换直接建模复杂条件概率分布,显式刻画脑电信号的随机特性,并从根本上规避了生成模型常见的模式崩溃问题;同时融合多尺度结构与自注意力机制,有效捕获精细时序细节和长程依赖关系,显著提升了时间波形保真度、频谱特征再现性和功能连接恢复能力。

链接: https://arxiv.org/abs/2603.03354
作者: Dongyi He,Bin Jiang,Kecheng Feng,Luyin Zhang,Ling Liu,Yuxuan Li,Yun Zhao,He Yan
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although obtaining deep brain activity from non-invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagnosis, directly generating high-fidelity intracranial electroencephalography (iEEG) signals remains a largely unexplored field, limiting our understanding of deep brain dynamics. Current research primarily focuses on traditional signal processing or source localization methods, which struggle to capture the complex waveforms and random characteristics of iEEG. To address this critical challenge, this paper introduces NeuroFlowNet, a novel cross-modal generative framework whose core contribution lies in the first-ever reconstruction of iEEG signals from the entire deep temporal lobe region using sEEG signals. NeuroFlowNet is built on Conditional Normalizing Flow (CNF), which directly models complex conditional probability distributions through reversible transformations, thereby explicitly capturing the randomness of brain signals and fundamentally avoiding the pattern collapse issues common in existing generative models. Additionally, the model integrates a multi-scale architecture and self-attention mechanisms to robustly capture fine-grained temporal details and long-range dependencies. Validation results on a publicly available synchronized sEEG-iEEG dataset demonstrate NeuroFlowNet’s effectiveness in terms of temporal waveform fidelity, spectral feature reproduction, and functional connectivity restoration. This study establishes a more reliable and scalable new paradigm for non-invasive analysis of deep brain dynamics. The code of this study is available in this https URL

[AI-90] Perfect score on IPhO 2025 theory by Gemini agent

【速读】:该论文试图解决的问题是:当前人工智能(AI)模型在国际物理奥林匹克竞赛(International Physics Olympiad, IPhO)理论题中的表现是否能够达到甚至超越顶尖人类参赛者的水平。其解决方案的关键在于构建一个基于Gemini 3.1 Pro Preview的简单智能体(agent),并在IphO 2025理论题上进行五次独立测试,结果均获得满分。尽管存在因模型发布时间晚于比赛而可能引发的数据泄露(data contamination)风险,该实验仍表明,当前先进生成式AI系统在特定复杂物理问题求解中已具备与人类顶级选手相当的能力。

链接: https://arxiv.org/abs/2603.03352
作者: Yichen Huang
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The International Physics Olympiad (IPhO) is the world’s most prestigious and renowned physics competition for pre-university students. IPhO problems require complex reasoning based on deep understanding of physical principles in a standard general physics curriculum. On IPhO 2025 theory problems, while gold medal performance by AI models was reported previously, it falls behind the best human contestant. Here we build a simple agent with Gemini 3.1 Pro Preview. We run it five times and it achieved a perfect score every time. However, data contamination could occur because Gemini 3.1 Pro Preview was released after the competition.

[AI-91] Physics-constrained symbolic regression for discovering closed-form equations of multimodal water retention curves from experimental data

【速读】:该论文旨在解决多模态孔隙尺寸分布的多孔材料在非饱和状态下的水力行为建模难题,传统水力模型难以准确刻画其复杂的多尺度特性。解决方案的关键在于提出一种物理约束的机器学习框架,通过遗传编程(Genetic Programming)自动发现闭合形式的水力滞留曲线表达式,同时将物理约束嵌入损失函数中,确保所获数学表达式在物理一致性和数学稳健性上满足要求,从而实现从实验数据中自动提取可解释且具有泛化能力的水力模型。

链接: https://arxiv.org/abs/2603.03346
作者: Yejin Kim,Hyoung Suk Suh
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Modeling the unsaturated behavior of porous materials with multimodal pore size distributions presents significant challenges, as standard hydraulic models often fail to capture their complex, multi-scale characteristics. A common workaround involves superposing unimodal retention functions, each tailored to a specific pore size range; however, this approach requires separate parameter identification for each mode, which limits interpretability and generalizability, especially in data-sparse scenarios. In this work, we introduce a fundamentally different approach: a physics-constrained machine learning framework designed for meta-modeling, enabling the automatic discovery of closed-form mathematical expressions for multimodal water retention curves directly from experimental data. Mathematical expressions are represented as binary trees and evolved via genetic programming, while physical constraints are embedded into the loss function to guide the symbolic regressor toward solutions that are physically consistent and mathematically robust. Our results demonstrate that the proposed framework can discover closed-form equations that effectively represent the water retention characteristics of porous materials with varying pore structures. To support third-party validation, application, and extension, we make the full implementation publicly available in an open-source repository.

[AI-92] GreenPhase: A Green Learning Approach for Earthquake Phase Picking

【速读】:该论文旨在解决地震检测与震相拾取(seismic phase picking)任务中因信噪比低、波形变异性和事件重叠导致的挑战,同时应对现有深度学习模型依赖大规模数据集和耗时的反向传播训练所引发的效率、可解释性与可持续性问题。其解决方案的关键在于提出GreenPhase模型——一种基于Green Learning框架的多分辨率、前馈式且数学可解释的架构,该模型通过三个层级的集成设计(无监督表征学习、有监督特征学习与决策学习),实现从粗到细的预测优化,并将计算限制在候选区域,从而无需反向传播即可独立优化各模块,显著降低推理阶段的计算成本(FLOPs减少约83%),并在斯坦福地震数据集(STEAD)上实现高精度性能(检测F1=1.0,P波拾取F1=0.98,S波拾取F1=0.96)。

链接: https://arxiv.org/abs/2603.03344
作者: Yixing Wu,Shiou-Ya Wang,Dingyi Nie,Sanket Kumbhar,Yun-Tung Hsieh,Yun-Cheng Wang,Po-Chyi Su,C.-C. Jay Kuo
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Earthquake detection and seismic phase picking are fundamental yet challenging tasks in seismology due to low signal-to-noise ratios, waveform variability, and overlapping events. Recent deep-learning models achieve strong results but rely on large datasets and heavy backpropagation training, raising concerns over efficiency, interpretability, and sustainability. We propose GreenPhase, a multi-resolution, feed-forward, and mathematically interpretable model based on the Green Learning framework. GreenPhase comprises three resolution levels, each integrating unsupervised representation learning, supervised feature learning, and decision learning. Its feed-forward design eliminates backpropagation, enabling independent module optimization with stable training and clear interpretability. Predictions are refined from coarse to fine resolutions while computation is restricted to candidate regions. On the Stanford Earthquake Dataset (STEAD), GreenPhase achieves excellent performance with F1 scores of 1.0 for detection, 0.98 for P-wave picking, and 0.96 for S-wave picking. This is accomplished while reducing the computational cost (FLOPs) for inference by approximately 83% compared to state-of-the-art models. These results demonstrate that the proposed model provides an efficient, interpretable, and sustainable alternative for large-scale seismic monitoring. Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.03344 [physics.geo-ph] (or arXiv:2603.03344v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2603.03344 Focus to learn more arXiv-issued DOI via DataCite

[AI-93] Neuro-Symbolic Decoding of Neural Activity ICLR2026

【速读】:该论文旨在解决功能性磁共振成像(fMRI)解码中概念识别与语义 grounding 的难题,即如何从大脑神经活动模式中准确推断出个体所感知或思考的视觉概念及其相互关系。其解决方案的关键在于提出 NEURONA——一个神经符号框架,通过整合符号推理(symbolic reasoning)与组合执行(compositional execution)机制,并利用 fMRI 信号在不同脑区的结构化表征,实现对视觉刺激中交互概念的精准解码。特别地,该方法引入结构先验(如概念间的谓词-论元依赖关系),显著提升了对精确查询的解码准确率以及对未见查询的泛化能力。

链接: https://arxiv.org/abs/2603.03343
作者: Yanchen Wang,Joy Hsu,Ehsan Adeli,Jiajun Wu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026. First two authors contributed equally

点击查看摘要

Abstract:We propose NEURONA, a neuro-symbolic framework for fMRI decoding and concept grounding in neural activity. Leveraging image- and video-based fMRI question-answering datasets, NEURONA learns to decode interacting concepts from visual stimuli based on patterns of fMRI responses, integrating symbolic reasoning and compositional execution with fMRI grounding across brain regions. We demonstrate that incorporating structural priors (e.g., compositional predicate-argument dependencies between concepts) into the decoding process significantly improves both decoding accuracy over precise queries, and notably, generalization to unseen queries at test time. With NEURONA, we highlight neuro-symbolic frameworks as promising tools for understanding neural activity.

[AI-94] Cryo-SWAN: the Multi-Scale Wavelet-decomposition-inspired Autoencoder Network for molecular density representation of molecular volumes

【速读】:该论文旨在解决从体素化数据中学习鲁棒的3D形状表征问题,特别是在结构生物学和冷冻电子显微镜(cryo-EM)领域中,其原始数据为体密度图(volumetric density maps),但当前主流3D计算机视觉方法多基于点云、网格或八叉树,导致该格式的数据被相对忽视。解决方案的关键在于提出Cryo-SWAN——一种受多尺度小波分解启发的体素化变分自编码器,其通过条件式的粗到细潜在编码与跨感知尺度的递归残差量化机制,在保留全局几何结构的同时精准捕捉分子密度体积中的高频细节,从而在ModelNet40、BuildingNet及新构建的cryo-EM数据集ProteinNet3D上显著优于现有3D自编码器,并支持基于几何特征的潜在空间组织与扩散模型驱动的去噪及条件形状生成。

链接: https://arxiv.org/abs/2603.03342
作者: Rui Li,Artsemi Yushkevich,Mikhail Kudryashev,Artur Yakimovich
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Learning robust representations of 3D shapes from voxelized data is essential for advancing AI methods in biomedical imaging. However, most contemporary 3D computer vision approaches operate on point clouds, meshes, or octrees, while volumetric density maps, the native format of structural biology and cryo-EM, remain comparatively underexplored. We present Cryo-SWAN, a voxel-based variational autoencoder inspired by multi-scale wavelet decomposition. The model performs conditional coarse-to-fine latent encoding and recursive residual quantization across perception scales, enabling accurate capture of both global geometry and high-frequency structural detail in molecular density volumes. Evaluated on ModelNet40, BuildingNet, and a newly curated dataset of cryo-EM volumes, ProteinNet3D, Cryo-SWAN consistently improves reconstruction quality over state-of-the-art 3D autoencoders. We demonstrate that the molecular densities organize in learned latent space according to shared geometric features, while integration with diffusion models enables denoising and conditional shape generation. Together, Cryo-SWAN is a practical framework for data-driven structural biology and volumetric imaging.

机器学习

[LG-0] Accurate and Efficient Hybrid-Ensemble Atmospheric Data Assimilation in Latent Space with Uncertainty Quantification

链接: https://arxiv.org/abs/2603.04395
作者: Hang Fan,Juan Nathaniel,Yi Xiao,Ce Bian,Fenghua Ling,Ben Fei,Lei Bai,Pierre Gentine
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 23 pages, 12 figures

点击查看摘要

Abstract:Data assimilation (DA) combines model forecasts and observations to estimate the optimal state of the atmosphere with its uncertainty, providing initial conditions for weather prediction and reanalyses for climate research. Yet, existing traditional and machine-learning DA methods struggle to achieve accuracy, efficiency and uncertainty quantification simultaneously. Here, we propose HLOBA (Hybrid-Ensemble Latent Observation-Background Assimilation), a three-dimensional hybrid-ensemble DA method that operates in an atmospheric latent space learned via an autoencoder (AE). HLOBA maps both model forecasts and observations into a shared latent space via the AE encoder and an end-to-end Observation-to-Latent-space mapping network (O2Lnet), respectively, and fuses them through a Bayesian update with weights inferred from time-lagged ensemble forecasts. Both idealized and real-observation experiments demonstrate that HLOBA matches dynamically constrained four-dimensional DA methods in both analysis and forecast skill, while achieving end-to-end inference-level efficiency and theoretical flexibility applies to any forecasting model. Moreover, by exploiting the error decorrelation property of latent variables, HLOBA enables element-wise uncertainty estimates for its latent analysis and propagates them to model space via the decoder. Idealized experiments show that this uncertainty highlights large-error regions and captures their seasonal variability.

[LG-1] Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights

链接: https://arxiv.org/abs/2603.04360
作者: Kenan Majewski,Michał Modzelewski,Marcin Żugaj,Piotr Lichota
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 3 figures, Submitted to the 29th International Conference on Information Fusion (FUSION 2026)

点击查看摘要

Abstract:The Unscented Kalman Filter (UKF) is a ubiquitous tool for nonlinear state estimation; however, its performance is limited by the static parameterization of the Unscented Transform (UT). Conventional weighting schemes, governed by fixed scaling parameters, assume implicit Gaussianity and fail to adapt to time-varying dynamics or heavy-tailed measurement noise. This work introduces the Meta-Adaptive UKF (MA-UKF), a framework that reformulates sigma-point weight synthesis as a hyperparameter optimization problem addressed via memory-augmented meta-learning. Unlike standard adaptive filters that rely on instantaneous heuristic corrections, our approach employs a Recurrent Context Encoder to compress the history of measurement innovations into a compact latent embedding. This embedding informs a policy network that dynamically synthesizes the mean and covariance weights of the sigma points at each time step, effectively governing the filter’s trust in the prediction versus the measurement. By optimizing the system end-to-end through the filter’s recursive logic, the MA-UKF learns to maximize tracking accuracy while maintaining estimation consistency. Numerical benchmarks on maneuvering targets demonstrate that the MA-UKF significantly outperforms standard baselines, exhibiting superior robustness to non-Gaussian glint noise and effective generalization to out-of-distribution (OOD) dynamic regimes unseen during training.

[LG-2] Out-of-distribution transfer of PDE foundation models to material dynamics under extreme loading

链接: https://arxiv.org/abs/2603.04354
作者: Mahindra Rautela,Alexander Most,Siddharth Mansingh,Aleksandra Pachalieva,Bradley Love,Daniel O Malley,Alexander Scheinker,Kyle Hickmann,Diane Oyen,Nathan Debardeleben,Earl Lawrence,Ayan Biswas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most PDE foundation models are pretrained and fine-tuned on fluid-centric benchmarks. Their utility under extreme-loading material dynamics remains unclear. We benchmark out-of-distribution transfer on two discontinuity-dominated regimes in which shocks, evolving interfaces, and fracture produce highly non-smooth fields: shock-driven multi-material interface dynamics (perturbed layered interface or PLI) and dynamic fracture/failure evolution (FRAC). We formulate the downstream task as terminal-state prediction, i.e., learning a long-horizon map that predicts the final state directly from the first snapshot without intermediate supervision. Using a unified training and evaluation protocol, we evaluate two open-source pretrained PDE foundation models, POSEIDON and MORPH, and compare fine-tuning from pretrained weights against training from scratch across training-set sizes to quantify sample efficiency under distribution shift.

[LG-3] A Constrained RL Approach for Cost-Efficient Delivery of Latency-Sensitive Applications

链接: https://arxiv.org/abs/2603.04353
作者: Ozan Aygün,Vincenzo Norman Vitale,Antonia M. Tulino,Hao Feng,Elza Erkip,Jaime Llorca
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, accepted for publication in 2025 59th Asilomar Conference on Signals, Systems, and Computers

点击查看摘要

Abstract:Next-generation networks aim to provide performance guarantees to real-time interactive services that require timely and cost-efficient packet delivery. In this context, the goal is to reliably deliver packets with strict deadlines imposed by the application while minimizing overall resource allocation cost. A large body of work has leveraged stochastic optimization techniques to design efficient dynamic routing and scheduling solutions under average delay constraints; however, these methods fall short when faced with strict per-packet delay requirements. We formulate the minimum-cost delay-constrained network control problem as a constrained Markov decision process and utilize constrained deep reinforcement learning (CDRL) techniques to effectively minimize total resource allocation cost while maintaining timely throughput above a target reliability level. Results indicate that the proposed CDRL-based solution can ensure timely packet delivery even when existing baselines fall short, and it achieves lower cost compared to other throughput-maximizing methods.

[LG-4] Algorithmic Compliance and Regulatory Loss in Digital Assets

链接: https://arxiv.org/abs/2603.04328
作者: Khem Raj Bhatt,Krishna Sharma
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:We study the deployment performance of machine learning based enforcement systems used in cryptocurrency anti money laundering (AML). Using forward looking and rolling evaluations on Bitcoin transaction data, we show that strong static classification metrics substantially overstate real world regulatory effectiveness. Temporal nonstationarity induces pronounced instability in cost sensitive enforcement thresholds, generating large and persistent excess regulatory losses relative to dynamically optimal benchmarks. The core failure arises from miscalibration of decision rules rather than from declining predictive accuracy per se. These findings underscore the fragility of fixed AML enforcement policies in evolving digital asset markets and motivate loss-based evaluation frameworks for regulatory oversight.

[LG-5] PTOPOFL: Privacy-Preserving Personalised Federated Learning via Persistent Homology

链接: https://arxiv.org/abs/2603.04323
作者: Kelly L Vomo-Donfack,Adryel Hoszu,Grégory Ginot,Ian Morilla
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 22 pages, 6 Figures

点击查看摘要

Abstract:Federated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48-dimensional PH feature vectors-compact shape summaries whose many-to-one structure makes inversion provably ill-posed-rather than model gradients. The server performs topology-guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra-cluster models are topology-weighted,and clusters are blended with a global consensus. We prove an information-contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein-weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non-IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively-the highest in both settings-while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at this https URL and data at this https URL.

[LG-6] LUMINA: Foundation Models for Topology Transferable ACOPF

链接: https://arxiv.org/abs/2603.04300
作者: Yijiang Li,Zeeshan Memon,Hongwei Jin,Stefano Fenu,Keunju Song,Sunash B Sharma,Parfait Gasana,Hongseok Kim,Liang Zhao,Kibaek Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models in general promise to accelerate scientific computation by learning reusable representations across problem instances, yet constrained scientific systems, where predictions must satisfy physical laws and safety limits, pose unique challenges that stress conventional training paradigms. We derive design principles for constrained scientific foundation models through systematic investigation of AC optimal power flow (ACOPF), a representative optimization problem in power grid operations where power balance equations and operational constraints are non-negotiable. Through controlled experiments spanning architectures, training objectives, and system diversity, we extract three empirically grounded principles governing scientific foundation model design. These principles characterize three design trade-offs: learning physics-invariant representations while respecting system-specific constraints, optimizing accuracy while ensuring constraint satisfaction, and ensuring reliability in high-impact operating regimes. We present the LUMINA framework, including data processing and training pipelines to support reproducible research on physics-informed, feasibility-aware foundation models across scientific applications.

[LG-7] Beyond Edge Deletion: A Comprehensive Approach to Counterfactual Explanation in Graph Neural Networks

链接: https://arxiv.org/abs/2603.04209
作者: Matteo De Sanctis,Riccardo De Sanctis,Stefano Faralli,Paola Velardi,Bardh Prenkaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are increasingly adopted across domains such as molecular biology and social network analysis, yet their black-box nature hinders interpretability and trust. This is especially problematic in high-stakes applications, such as predicting molecule toxicity, drug discovery, or guiding financial fraud detections, where transparent explanations are essential. Counterfactual explanations - minimal changes that flip a model’s prediction - offer a transparent lens into GNNs’ behavior. In this work, we introduce XPlore, a novel technique that significantly broadens the counterfactual search space. It consists of gradient-guided perturbations to adjacency and node feature matrices. Unlike most prior methods, which focus solely on edge deletions, our approach belongs to the growing class of techniques that optimize edge insertions and node-feature perturbations, here jointly performed under a unified gradient-based framework, enabling a richer and more nuanced exploration of counterfactuals. To quantify both structural and semantic fidelity, we introduce a cosine similarity metric for learned graph embeddings that addresses a key limitation of traditional distance-based metrics, and demonstrate that XPlore produces more coherent and minimal counterfactuals. Empirical results on 13 real-world and 5 synthetic benchmarks show up to +56.3% improvement in validity and +52.8% in fidelity over state-of-the-art baselines, while retaining competitive runtime.

[LG-8] REDNET-ML: A Multi-Sensor Machine Learning Pipeline for Harmful Algal Bloom Risk Detection Along the Omani Coast

链接: https://arxiv.org/abs/2603.04181
作者: Ameer Alhashemi
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Harmful algal blooms (HABs) can threaten coastal infrastructure, fisheries, and desalination dependent water supplies. This project (REDNET-ML) develops a reproducible machine learning pipeline for HAB risk detection along the Omani coastline using multi sensor satellite data and non leaky evaluation. The system fuses (i) Sentinel-2 optical chips (high spatial resolution) processed into spectral indices and texture signals, (ii) MODIS Level-3 ocean color and thermal indicators, and (iii) learned image evidence from object detectors trained to highlight bloom like patterns. A compact decision fusion model (CatBoost) integrates these signals into a calibrated probability of HAB risk, which is then consumed by an end to end inference workflow and a risk field viewer that supports operational exploration by site (plant) and time. The report documents the motivation, related work, methodological choices (including label mining and strict split strategies), implementation details, and a critical evaluation using AUROC/AUPRC, confusion matrices, calibration curves, and drift analyses that quantify distribution shift in recent years.

[LG-9] Learning Hip Exoskeleton Control Policy via Predictive Neuromusculoskeletal Simulation

链接: https://arxiv.org/abs/2603.04166
作者: Ilseung Park,Changseob Song,Inseung Kang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing exoskeleton controllers that generalize across diverse locomotor conditions typically requires extensive motion-capture data and biomechanical labeling, limiting scalability beyond instrumented laboratory settings. Here, we present a physics-based neuromusculoskeletal learning framework that trains a hip-exoskeleton control policy entirely in simulation, without motion-capture demonstrations, and deploys it on hardware via policy distillation. A reinforcement learning teacher policy is trained using a muscle-synergy action prior over a wide range of walking speeds and slopes through a two-stage curriculum, enabling direct comparison between assisted and no-exoskeleton conditions. In simulation, exoskeleton assistance reduces mean muscle activation by up to 3.4% and mean positive joint power by up to 7.0% on level ground and ramp ascent, with benefits increasing systematically with walking speed. On hardware, the assistance profiles learned in simulation are preserved across matched speed-slope conditions (r: 0.82, RMSE: 0.03 Nm/kg), providing quantitative evidence of sim-to-real transfer without additional hardware tuning. These results demonstrate that physics-based neuromusculoskeletal simulation can serve as a practical and scalable foundation for exoskeleton controller development, substantially reducing experimental burden during the design phase.

[LG-10] A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

链接: https://arxiv.org/abs/2603.04142
作者: Davide Gabrielli,Paola Velardi,Stefano Faralli,Bardh Prenkaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous physiological monitoring is central to emergency care, yet deploying trustworthy AI is challenging. While LLMs can translate complex physiological signals into clinical narratives, it is unclear how agentic systems perform relative to zero-shot inference. To address these questions, we present Vivaldi, a role-structured multi-agent system that explains multivariate physiological time series. Due to regulatory constraints that preclude live deployment, we instantiate Vivaldi in a controlled, clinical pilot to a small, highly qualified cohort of emergency medicine experts, whose evaluations reveal a context-dependent picture that contrasts with prevailing assumptions that agentic reasoning uniformly improves performance. Our experiments show that agentic pipelines substantially benefit non-thinking and medically fine-tuned models, improving expert-rated explanation justification and relevance by +6.9 and +9.7 points, respectively. Contrarily, for thinking models, agentic orchestration often degrades explanation quality, including a 14-point drop in relevance, while improving diagnostic precision (ESI F1 +3.6). We also find that explicit tool-based computation is decisive for codifiable clinical metrics, whereas subjective targets, such as pain scores and length of stay, show limited or inconsistent changes. Expert evaluation further indicates that gains in clinical utility depend on visualization conventions, with medically specialized models achieving the most favorable trade-offs between utility and clarity. Together, these findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade-offs and learned lessons, broadly applicable to explainable AI in safety-critical healthcare settings.

[LG-11] InstMeter: An Instruction-Level Method to Predict Energy and Latency of DL Model Inference on MCUs

链接: https://arxiv.org/abs/2603.04134
作者: Hao Liu,Qing Wang,Marco Zuniga
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Deep learning (DL) models can now run on microcontrollers (MCUs). Through neural architecture search (NAS), we can search DL models that meet the constraints of MCUs. Among various constraints, energy and latency costs of the model inference are critical metrics. To predict them, existing research relies on coarse proxies such as multiply-accumulations (MACs) and model’s input parameters, often resulting in inaccurate predictions or requiring extensive data collection. In this paper, we propose InstMeter, a predictor leveraging MCUs’ clock cycles to accurately estimate the energy and latency of DL models. Clock cycles are fundamental metrics reflecting MCU operations, directly determining energy and latency costs. Furthermore, a unique property of our predictor is its strong linearity, allowing it to be simple and accurate. We thoroughly evaluate InstMeter under different scenarios, MCUs, and software settings. Compared with state-of-the-art studies, InstMeter can reduce the energy and latency prediction errors by 3\times and 6.5\times , respectively, while requiring 100\times and 10\times less training data. In the NAS scenario, InstMeter can fully exploit the energy budget, identifying optimal DL models with higher inference accuracy. We also evaluate InstMeter’s generalization performance through various experiments on three ARM MCUs (Cortex-M4, M7, M33) and one RISC-V-based MCU (ESP32-C3), different compilation options (-Os, -O2), GCC versions (v7.3, v10.3), application scenarios (keyword spotting, image recognition), dynamic voltage and frequency scaling, temperatures (21°C, 43°C), and software settings (TFLMv2.4, TFLMvCI). We will open our source codes and the MCU-specific benchmark datasets.

[LG-12] wo-Stage Photovoltaic Forecasting: Separating Weather Prediction from Plant-Characteristics

链接: https://arxiv.org/abs/2603.04132
作者: Philipp Danner,Hermann de Meer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several energy management applications rely on accurate photovoltaic generation forecasts. Common metrics like mean absolute error or root-mean-square error, omit error-distribution details needed for stochastic optimization. In addition, several approaches use weather forecasts as inputs without analyzing the source of the prediction error. To overcome this gap, we decompose forecasting into a weather forecast model for environmental parameters such as solar irradiance and temperature and a plant characteristic model that captures site-specific parameters like panel orientation, temperature influence, or regular shading. Satellite-based weather observation serves as an intermediate layer. We analyze the error distribution of the high-resolution rapid-refresh numerical weather prediction model that covers the United States as a black-box model for weather forecasting and train an ensemble of neural networks on historical power output data for the plant characteristic model. Results show mean absolute error increases by 11% and 68% for two selected photovoltaic systems when using weather forecasts instead of satellite-based ground-truth weather observations as a perfect forecast. The generalized hyperbolic and Student’s t distributions adequately fit the forecast errors across lead times.

[LG-13] FastWave: Optimized Diffusion Model for Audio Super-Resolution

链接: https://arxiv.org/abs/2603.04122
作者: Nikita Kuznetsov,Maksim Kaledin
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Audio Super-Resolution is a set of techniques aimed at high-quality estimation of the given signal as if it would be sampled with higher sample rate. Among suggested methods there are diffusion and flow models (which are considered slower), generative adversarial networks (which are considered faster), however both approaches are currently presented by high-parametric networks, requiring high computational costs both for training and inference. We propose a solution to both these problems by re-considering the recent advances in the training of diffusion models and applying them to super-resolution from any to 48 kHz sample rate. Our approach shows better results than NU-Wave 2 and is comparable to state-of-the-art models. Our model called FastWave has around 50 GFLOPs of computational complexity and 1.3 M parameters and can be trained with less resources and significantly faster than the majority of recently proposed diffusion- and flow-based solutions. The code has been made publicly available.

[LG-14] When to restart? Exploring escalating restarts on convergence ICLR2026

链接: https://arxiv.org/abs/2603.04117
作者: Ayush K. Varshney,Šarūnas Girdzijauskas,Konstantinos Vandikas,Aneta Vulgarakis Feljan
类目: Machine Learning (cs.LG)
*备注: Paper accepted in Sci4DL workshop in ICLR 2026. this https URL

点击查看摘要

Abstract:Learning rate scheduling plays a critical role in the optimization of deep neural networks, directly influencing convergence speed, stability, and generalization. While existing schedulers such as cosine annealing, cyclical learning rates, and warm restarts have shown promise, they often rely on fixed or periodic triggers that are agnostic to the training dynamics, such as stagnation or convergence behavior. In this work, we propose a simple yet effective strategy, which we call Stochastic Gradient Descent with Escalating Restarts (SGD-ER). It adaptively increases the learning rate upon convergence. Our method monitors training progress and triggers restarts when stagnation is detected, linearly escalating the learning rate to escape sharp local minima and explore flatter regions of the loss landscape. We evaluate SGD-ER across CIFAR-10, CIFAR-100, and TinyImageNet on a range of architectures including ResNet-18/34/50, VGG-16, and DenseNet-101. Compared to standard schedulers, SGD-ER improves test accuracy by 0.5-4.5%, demonstrating the benefit of convergence-aware escalating restarts for better local optima.

[LG-15] Reducing hyperparameter sensitivity in measurement-feedback based Ising machines

链接: https://arxiv.org/abs/2603.04093
作者: Toon Sevenants,Guy Van der Sande,Guy Verschaffelt
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Analog Ising machines have been proposed as heuristic hardware solvers for combinatorial optimization problems, with the potential to outperform conventional approaches, provided that their hyperparameters are carefully tuned. Their temporal evolution is often described using time-continuous dynamics. However, most experimental implementations rely on measurement-feedback architectures that operate in a time-discrete manner. We observe that in such setups, the range of effective hyperparameters is substantially smaller than in the envisioned time-continuous analog Ising machine. In this paper, we analyze this discrepancy and discuss its impact on the practical operation of Ising machines. Next, we propose and experimentally verify a method to reduce the sensitivity to hyperparameter selection of these measurement-feedback architectures.

[LG-16] FedCova: Robust Federated Covariance Learning Against Noisy Labels

链接: https://arxiv.org/abs/2603.04062
作者: Xiangyu Zhong,Xiaojun Yuan,Ying-Jun Angela Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Noisy labels in distributed datasets induce severe local overfitting and consequently compromise the global model in federated learning (FL). Most existing solutions rely on selecting clean devices or aligning with public clean datasets, rather than endowing the model itself with robustness. In this paper, we propose FedCova, a dependency-free federated covariance learning framework that eliminates such external reliances by enhancing the model’s intrinsic robustness via a new perspective on feature covariances. Specifically, FedCova encodes data into a discriminative but resilient feature space to tolerate label noise. Built on mutual information maximization, we design a novel objective for federated lossy feature encoding that relies solely on class feature covariances with an error tolerance term. Leveraging feature subspaces characterized by covariances, we construct a subspace-augmented federated classifier. FedCova unifies three key processes through the covariance: (1) training the network for feature encoding, (2) constructing a classifier directly from the learned features, and (3) correcting noisy labels based on feature subspaces. We implement FedCova across both symmetric and asymmetric noisy settings under heterogeneous data distribution. Experimental results on CIFAR-10/100 and real-world noisy dataset Clothing1M demonstrate the superior robustness of FedCova compared with the state-of-the-art methods.

[LG-17] mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon

链接: https://arxiv.org/abs/2603.04035
作者: Han Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:mlx-vis is a Python library that implements six dimensionality reduction methods and a k-nearest neighbor graph algorithm entirely in MLX, Apple’s array framework for Apple Silicon. The library provides UMAP, t-SNE, PaCMAP, TriMap, DREAMS, CNE, and NNDescent, all executing on Metal GPU through a unified fit_transform interface. Beyond embedding computation, mlx-vis includes a GPU-accelerated circle-splatting renderer that produces scatter plots and smooth animations without matplotlib, composing frames via scatter-add alpha blending on GPU and piping them to hardware H.264 encoding. On Fashion-MNIST with 70,000 points, all methods complete embedding in 2.1-3.8 seconds and render 800-frame animations in 1.4 seconds on an M3 Ultra, with the full pipeline from raw data to rendered video finishing in 3.6-5.2 seconds. The library depends only on MLX and NumPy, is released under the Apache 2.0 license, and is available at this https URL.

[LG-18] Multi-Stage Music Source Restoration with BandSplit-RoFormer Separation and HiFi GAN ICASSP2026

链接: https://arxiv.org/abs/2603.04032
作者: Tobias Morocutti,Emmanouil Karystinaios,Jonathan Greif,Gerhard Widmer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: ICASSP 2026 Music Source Restoration (MSR) Challenge

点击查看摘要

Abstract:Music Source Restoration (MSR) targets recovery of original, unprocessed instrument stems from fully mixed and mastered audio, where production effects and distribution artifacts violate common linear-mixture assumptions. This technical report presents the CP-JKU team’s system for the MSR ICASSP Challenge 2025. Our approach decomposes MSR into separation and restoration. First, a single BandSplit-RoFormer separator predicts eight stems plus an auxiliary other stem, and is trained with a three-stage curriculum that progresses from 4-stem warm-start fine-tuning (with LoRA) to 8-stem extension via head expansion. Second, we apply a HiFi++ GAN waveform restorer trained as a generalist and then specialized into eight instrument-specific experts.

[LG-19] Continuous Modal Logical Neural Networks: Modal Reasoning via Stochastic Accessibility

链接: https://arxiv.org/abs/2603.04019
作者: Antonin Sulc
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 20th INTERNATIONAL CONFERENCE ON NEUROSYMBOLIC LEARNING AND REASONING

点击查看摘要

Abstract:We propose Fluid Logic, a paradigm in which modal logical reasoning, temporal, epistemic, doxastic, deontic, is lifted from discrete Kripke structures to continuous manifolds via Neural Stochastic Differential Equations (Neural SDEs). Each type of modal operator is backed by a dedicated Neural SDE, and nested formulas compose these SDEs in a single differentiable graph. A key instantiation is Logic-Informed Neural Networks (LINNs): analogous to Physics-Informed Neural Networks (PINNs), LINNs embed modal logical formulas such as ( \Box bounded) and ( \Diamond visits_lobe) directly into the training loss, guiding neural networks to produce solutions that are structurally consistent with prescribed logical properties, without requiring knowledge of the governing equations. The resulting framework, Continuous Modal Logical Neural Networks (CMLNNs), yields several key properties: (i) stochastic diffusion prevents quantifier collapse ( \Box and \Diamond differ), unlike deterministic ODEs; (ii) modal operators are entropic risk measures, sound with respect to risk-based semantics with explicit Monte Carlo concentration guarantees; (iii)SDE-induced accessibility provides structural correspondence with classical modal axioms; (iv) parameterizing accessibility through dynamics reduces memory from quadratic in world count to linear in parameters. Three case studies demonstrate that Fluid Logic and LINNs can guide neural networks to produce consistent solutions across diverse domains: epistemic/doxastic logic (multi-robot hallucination detection), temporal logic (recovering the Lorenz attractor geometry from logical constraints alone), and deontic logic (learning safe confinement dynamics from a logical specification). Comments: 10 pages, 5 figures, 20th INTERNATIONAL CONFERENCE ON NEUROSYMBOLIC LEARNING AND REASONING Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG) Cite as: arXiv:2603.04019 [cs.LO] (or arXiv:2603.04019v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2603.04019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Fixed-Budget Constrained Best Arm Identification in Grouped Bandits

链接: https://arxiv.org/abs/2603.04007
作者: Raunak Mukherjee(1),Sharayu Moharir(1) ((1) Indian Institute of Technology, Bombay)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 2 Figures

点击查看摘要

Abstract:We study fixed budget constrained best-arm identification in grouped bandits, where each arm consists of multiple independent attributes with stochastic rewards. An arm is considered feasible only if all its attributes’ means are above a given threshold. The aim is to find the feasible arm with the largest overall mean. We first derive a lower bound on the error probability for any algorithm on this setting. We then propose Feasibility Constrained Successive Rejects (FCSR), a novel algorithm that identifies the best arm while ensuring feasibility. We show it attains optimal dependence on problem parameters up to constant factors in the exponent. Empirically, FCSR outperforms natural baselines while preserving feasibility guarantees.

[LG-21] raining-Free Rate-Distortion-Perception Traversal With Diffusion

链接: https://arxiv.org/abs/2603.04005
作者: Yuhan Wang,Suzhi Bi,Ying-Jun Angela Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 40 pages, 17 figures

点击查看摘要

Abstract:The rate-distortion-perception (RDP) tradeoff characterizes the fundamental limits of lossy compression by jointly considering bitrate, reconstruction fidelity, and perceptual quality. While recent neural compression methods have improved perceptual performance, they typically operate at fixed points on the RDP surface, requiring retraining to target different tradeoffs. In this work, we propose a training-free framework that leverages pre-trained diffusion models to traverse the entire RDP surface. Our approach integrates a reverse channel coding (RCC) module with a novel score-scaled probability flow ODE decoder. We theoretically prove that the proposed diffusion decoder is optimal for the distortion-perception tradeoff under AWGN observations and that the overall framework with the RCC module achieves the optimal RDP function in the Gaussian case. Empirical results across multiple datasets demonstrate the framework’s flexibility and effectiveness in navigating the ternary RDP tradeoff using pre-trained diffusion models. Our results establish a practical and theoretically grounded approach to adaptive, perception-aware compression.

[LG-22] On the Learnability of Offline Model-Based Optimization: A Ranking Perspective

链接: https://arxiv.org/abs/2603.04000
作者: Shen-Huan Lyu,Rong-Xi Tan,Ke Xue,Yi-Xiao He,Yu Huang,Qingfu Zhang,Chao Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline model-based optimization (MBO) seeks to discover high-performing designs using only a fixed dataset of past evaluations. Most existing methods rely on learning a surrogate model via regression and implicitly assume that good predictive accuracy leads to good optimization performance. In this work, we challenge this assumption and study offline MBO from a learnability perspective. We argue that offline optimization is fundamentally a problem of ranking high-quality designs rather than accurate value prediction. Specifically, we introduce an optimization-oriented risk based on ranking between near-optimal and suboptimal designs, and develop a unified theoretical framework that connects surrogate learning to final optimization. We prove the theoretical advantages of ranking over regression, and identify distributional mismatch between the training data and near-optimal designs as the dominant error. Inspired by this, we design a distribution-aware ranking method to reduce this mismatch. Empirical results across various tasks show that our approach outperforms twenty existing methods, validating our theoretical findings. Additionally, both theoretical and empirical results reveal intrinsic limitations in offline MBO, showing a regime in which no offline method can avoid over-optimistic extrapolation.

[LG-23] Specialization of softmax attention heads: insights from the high-dimensional single-location model

链接: https://arxiv.org/abs/2603.03993
作者: M. Sagitova,O. Duranthon,L. Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks. In the first part, we analyze the training dynamics of multi-head softmax attention under SGD, revealing an initial unspecialized phase followed by a multi-stage specialization phase in which different heads sequentially align with latent signal directions. In the second part, we study the impact of attention activation functions on performance. We show that softmax-1 significantly reduces noise from irrelevant heads. Finally, we introduce the Bayes-softmax attention, which achieves optimal prediction performance in this setting.

[LG-24] LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

链接: https://arxiv.org/abs/2603.03959
作者: Md Akib Haider,Ahsan Bulbul,Nafis Fuad Shahid,Aimaan Ahmed,Mohammad Ishrak Abedin
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE’26 Tool Competition, we present \textbfLoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an \textbfF1 Weighted score of 0.7906 and a \textbfMacro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.

[LG-25] Lang2Str: Two-Stage Crystal Structure Generation with LLM s and Continuous Flow Models

链接: https://arxiv.org/abs/2603.03946
作者: Cong Liu,Chengyue Gong,Zhenyu Liu,Jiale Zhao,Yuxuan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models hold great promise for accelerating material discovery but are often limited by their inflexible single-stage generative process in designing valid and diverse materials. To address this, we propose a two-stage generative framework, Lang2Str, that combines the strengths of large language models (LLMs) and flow-based models for flexible and precise material generation. Our method frames the generative process as a conditional generative task, where an LLM provides high-level conditions by generating descriptions of material unit cells’ geometric layouts and properties. These descriptions, informed by the LLM’s extensive background knowledge, ensure reasonable structure designs. A conditioned flow model then decodes these textual conditions into precise continuous coordinates and unit cell parameters. This staged approach combines the structured reasoning of LLMs and the distribution modeling capabilities of flow models. Experimental results show that our method achieves competitive performance on \textitab initio material generation and crystal structure prediction tasks, with generated structures exhibiting closer alignment to ground truth in both geometry and energy levels, surpassing state-of-the-art models. The flexibility and modularity of our framework further enable fine-grained control over the generation process, potentially leading to more efficient and customizable material design.

[LG-26] How Predicted Links Influence Network Evolution: Disentangling Choice and Algorithmic Feedback in Dynamic Graphs

链接: https://arxiv.org/abs/2603.03945
作者: Mathilde Perez,Raphaël Romero,Jefrey Lijffijt,Charlotte Laclau
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction models are increasingly used to recommend interactions in evolving networks, yet their impact on network structure is typically assessed from static snapshots. In particular, observed homophily conflates intrinsic interaction tendencies with amplification effects induced by network dynamics and algorithmic feedback. We propose a temporal framework based on multivariate Hawkes processes that disentangles these two sources and introduce an instantaneous bias measure derived from interaction intensities, capturing current reinforcement dynamics beyond cumulative metrics. We provide a theoretical characterization of the stability and convergence of the induced dynamics, and experiments show that the proposed measure reliably reflects algorithmic feedback effects across different link prediction strategies.

[LG-27] Hierarchical Inference and Closure Learning via Adaptive Surrogates for ODEs and PDEs

链接: https://arxiv.org/abs/2603.03922
作者: Pengyu Zhang,Arnaud Vadeboncoeur,Alex Glyn-Davies,Mark Girolami
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inverse problems are the task of calibrating models to match data. They play a pivotal role in diverse engineering applications by allowing practitioners to align models with reality. In many applications, engineers and scientists do not have a complete picture of i) the detailed properties of a system (such as material properties, geometry, initial conditions, etc.); ii) the complete laws describing all dynamics at play (such as friction laws, complicated damping phenomena, and general nonlinear interactions). In this paper, we develop a principled methodology for leveraging data from collections of distinct yet related physical systems to jointly estimate the individual model parameters of each system, and learn the shared unknown dynamics in the form of an ML-based closure model. To robustly infer the unknown parameters for each system, we employ a hierarchical Bayesian framework, which allows for the joint inference of multiple systems and their population-level statistics. To learn the closures, we use a maximum marginal likelihood estimate of a neural network embeded within the ODE/PDE formulation of the problem. To realize this framework we utilize the ensemble Metropolis-Adjusted Langevin Algorithm (MALA) for stable and efficient sampling. To mitigate the computational bottleneck of repetitive forward evaluations in solving inverse problems, we introduce a bilevel optimization strategy to simultaneously train a surrogate forward model alongside the inference. Within this framework, we evaluate and compare distinct surrogate architectures, specifically Fourier Neural Operators (FNO) and parametric Physics-Informed Neural Network (PINNs).

[LG-28] Believe Your Model: Distribution-Guided Confidence Calibration

链接: https://arxiv.org/abs/2603.03872
作者: Xizhong Yang,Haotian Zhang,Huiming Wang,Mofei Song
类目: Machine Learning (cs.LG)
*备注: 38 pages

点击查看摘要

Abstract:Large Reasoning Models have demonstrated remarkable performance with the advancement of test-time scaling techniques, which enhances prediction accuracy by generating multiple candidate responses and selecting the most reliable answer. While prior work has analyzed that internal model signals like confidence scores can partly indicate response correctness and exhibit a distributional correlation with accuracy, such distributional information has not been fully utilized to guide answer selection. Motivated by this, we propose DistriVoting, which incorporates distributional priors as another signal alongside confidence during voting. Specifically, our method (1) first decomposes the mixed confidence distribution into positive and negative components using Gaussian Mixture Models, (2) then applies a reject filter based on positive/negative samples from them to mitigate overlap between the two distributions. Besides, to further alleviate the overlap from the perspective of distribution itself, we propose SelfStepConf, which uses step-level confidence to dynamically adjust inference process, increasing the separation between the two distributions to improve the reliability of confidences in voting. Experiments across 16 models and 5 benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches.

[LG-29] k-hop Fairness: Addressing Disparities in Graph Link Prediction Beyond First-Order Neighborhoods

链接: https://arxiv.org/abs/2603.03867
作者: Lilian Marey,Tiphaine Viard,Charlotte Laclau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction (LP) plays a central role in graph-based applications, particularly in social recommendation. However, real-world graphs often reflect structural biases, most notably homophily, the tendency of nodes with similar attributes to connect. While this property can improve predictive performance, it also risks reinforcing existing social disparities. In response, fairness-aware LP methods have emerged, often seeking to mitigate these effects by promoting inter-group connections, that is, links between nodes with differing sensitive attributes (e.g., gender), following the principle of dyadic fairness. However, dyadic fairness overlooks potential disparities within the sensitive groups themselves. To overcome this issue, we propose k -hop fairness, a structural notion of fairness for LP, that assesses disparities conditioned on the distance between nodes in the graph. We formalize this notion through predictive fairness and structural bias metrics, and propose pre- and post-processing mitigation strategies. Experiments across standard LP benchmarks reveal: (1) a strong tendency of models to reproduce structural biases at different k -hops; (2) interdependence between structural biases at different hops when rewiring graphs; and (3) that our post-processing method achieves favorable k -hop performance-fairness trade-offs compared to existing fair LP baselines.

[LG-30] Large-Margin Hyperdimensional Computing: A Learning-Theoretical Perspective

链接: https://arxiv.org/abs/2603.03830
作者: Nikita Zeulin,Olga Galinina,Ravikumar Balakrishnan,Nageen Himayat,Sergey Andreev
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Overparameterized machine learning (ML) methods such as neural networks may be prohibitively resource intensive for devices with limited computational capabilities. Hyperdimensional computing (HDC) is an emerging resource efficient and low-complexity ML method that allows hardware efficient implementations of (re-)training and inference procedures. In this paper, we propose a maximum-margin HDC classifier, which significantly outperforms baseline HDC methods on several benchmark datasets. Our method leverages a formal relation between HDC and support vector machines (SVMs) that we established for the first time. Our findings may inspire novel HDC methods with potentially more hardware-oriented implementations compared to SVMs, thus enabling more efficient learning solutions for various intelligent resource-constrained applications.

[LG-31] A Bi-Stage Framework for Automatic Development of Pixel-Based Planar Antenna Structures

链接: https://arxiv.org/abs/2603.03810
作者: Khadijeh Askaripour,Adrian Bekasiewicz,Slawomir Koziel
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Development of modern antennas is a cognitive process that intertwines experience-driven determination of topology and tuning of its parameters to fulfill the performance specifications. Alternatively, the task can be formulated as an optimization problem so as to reduce reliance of geometry selection on engineering insight. In this work, a bi-stage framework for automatic generation of antennas is considered. The method determines free-form topology through optimization of interconnections between components (so-called pixels) that constitute the radiator. Here, the process involves global optimization of connections between pixels followed by fine-tuning of the resulting topology using a surrogate-assisted local-search algorithm to fulfill the design re-quirements. The approach has been demonstrated based on two case studies concerning development of broadband and dual-band monopole antennas.

[LG-32] Unsupervised Surrogate-Assisted Synthesis of Free-Form Planar Antenna Topologies for IoT Applications

链接: https://arxiv.org/abs/2603.03802
作者: Khadijeh Askaripour,Adrian Bekasiewicz,Slawomir Koziel
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Design of antenna structures for Internet of Things (IoT) applications is a challenging problem. Contemporary radiators are often subject to a number of electric and/or radiation-related requirements, but also constraints imposed by specifics of IoT systems and/or intended operational environments. Conventional approaches to antenna design typically involve manual development of topology intertwined with its tuning. Although proved useful, the approach is prone to errors and engineering bias. Alternatively, geometries can be generated and optimized without supervision of the designer. The process can be controlled by suitable algorithms to determine and then adjust the antenna geometry according to the specifications. Unfortunately, automatic design of IoT radiators is associated with challenges such as determination of desirable geometries or high optimization cost. In this work, a variable-fidelity framework for performance-oriented development of free-form antennas represented using the generic simulation models is proposed. The method employs a surrogate-assisted classifier capable of identifying a suitable radiator topology from a set of automatically generated (and stored for potential re-use) candidate designs. The obtained geometry is then subject to a bi-stage tuning performed using a gradient-based optimization engine. The presented framework is demonstrated based on six numerical experiments concerning unsupervised development of bandwidth-enhanced patch antennas dedicated to work within 5 GHz to 6 GHz and 6 GHz to 7 GHz bands, respectively. Extensive benchmarks of the method, as well as the generated topologies are also performed.

[LG-33] Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

链接: https://arxiv.org/abs/2603.03778
作者: Yuqi Kong,Xiao Zhang,Weiran Shen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner’s rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner’s behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of \tilde O(1/\sqrtN) , matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

[LG-34] LEA: Label Enumeration Attack in Vertical Federated Learning

链接: https://arxiv.org/abs/2603.03777
作者: Wenhao Jiang,Shaojing Fu,Yuchuan Luo,Lin Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A typical Vertical Federated Learning (VFL) scenario involves several participants collaboratively training a machine learning model, where each party has different features for the same samples, with labels held exclusively by one party. Since labels contain sensitive information, VFL must ensure the privacy of labels. However, existing VFL-targeted label inference attacks are either limited to specific scenarios or require auxiliary data, rendering them impractical in real-world applications. We introduce a novel Label Enumeration Attack (LEA) that, for the first time, achieves applicability across multiple VFL scenarios and eschews the need for auxiliary data. Our intuition is that an adversary, employing clustering to enumerate mappings between samples and labels, ascertains the accurate label mappings by evaluating the similarity between the benign model and the simulated models trained under each mapping. To achieve that, the first challenge is how to measure model similarity, as models trained on the same data can have different weights. Drawing from our findings, we propose an efficient approach for assessing congruence based on the cosine similarity of the first-round loss gradients, which offers superior efficiency and precision compared to the comparison of parameter similarities. However, the computational cost may be prohibitive due to the necessity of training and comparing the vast number of simulated models generated through enumeration. To overcome this challenge, we propose Binary-LEA from the perspective of reducing the number of models and eliminating futile training, which lowers the number of enumerations from n! to n^3. Moreover, LEA is resilient against common defense mechanisms such as gradient noise and gradient compression. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.03777 [cs.LG] (or arXiv:2603.03777v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.03777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Harmonic Dataset Distillation for Time Series Forecasting AAAI2026

链接: https://arxiv.org/abs/2603.03760
作者: Seungha Hong,Sanghwan Jang,Wonbin Kweon,Suyeon Kim,Gyuseok Lee,Hwanjo Yu
类目: Machine Learning (cs.LG)
*备注: AAAI 2026

点击查看摘要

Abstract:Time Series forecasting (TSF) in the modern era faces significant computational and storage cost challenges due to the massive scale of real-world data. Dataset Distillation (DD), a paradigm that synthesizes a small, compact dataset to achieve training performance comparable to that of the original dataset, has emerged as a promising solution. However, conventional DD methods are not tailored for time series and suffer from architectural overfitting and limited scalability. To address these issues, we propose Harmonic Dataset Distillation for Time Series Forecasting (HDT). HDT decomposes the time series into its sinusoidal basis through the FFT and aligns the core periodic structure by Harmonic Matching. Since this process operates in the frequency domain, all updates during distillation are applied globally without disrupting temporal dependencies of time series. Extensive experiments demonstrate that HDT achieves strong cross-architecture generalization and scalability, validating its practicality for large-scale, real-world applications.

[LG-36] A Stein Identity for q-Gaussians with Bounded Support

链接: https://arxiv.org/abs/2603.03673
作者: Sophia Sklaviadis,Thomas Moellenhoff,Andre F. T. Martins,Mario A. T. Figueiredo,Mohammad Emtiyaz Khan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stein’s identity is a fundamental tool in machine learning with applications in generative models, stochastic optimization, and other problems involving gradients of expectations under Gaussian distributions. Less attention has been paid to problems with non-Gaussian expectations. Here, we consider the class of bounded-support q -Gaussians and derive a new Stein identity leading to gradient estimators which have nearly identical forms to the Gaussian ones, and which are similarly easy to implement. We do this by extending the previous results of Landsman, Vanduffel, and Yao (2013) to prove new Bonnet- and Price-type theorems for q-Gaussians. We also simplify their forms by using escort distributions. Our experiments show that bounded-support distributions can reduce the variance of gradient estimators, which can potentially be useful for Bayesian deep learning and sharpness-aware minimization. Overall, our work simplifies the application of Stein’s identity for an important class of non-Gaussian distributions.

[LG-37] Freezing of Gait Prediction using Proactive Agent that Learns from Selected Experience and DDQN Algorithm

链接: https://arxiv.org/abs/2603.03651
作者: Septian Enggar Sukmana(1),Sang Won Bae(2),Tomohiro Shibata(1) ((1) Kyushu Institute of Technology, (2) Stevens Institute of Technology)
类目: Machine Learning (cs.LG)
*备注: Accepted on Activity and Behavior Computing (ABC) 2026 Conference ( this https URL ) and will be published on International Journal of Activity and Behavior Computing (IJABC) (International Journal of Activity and Behavior Computing)

点击查看摘要

Abstract:Freezing of Gait (FOG) is a debilitating motor symptom commonly experienced by individuals with Parkinson’s Disease (PD) which often leads to falls and reduced mobility. Timely and accurate prediction of FOG episodes is essential for enabling proactive interventions through assistive technologies. This study presents a reinforcement learning-based framework designed to identify optimal pre-FOG onset points, thereby extending the prediction horizon for anticipatory cueing systems. The model implements a Double Deep Q-Network (DDQN) architecture enhanced with Prioritized Experience Replay (PER) allowing the agent to focus learning on high-impact experiences and refine its policy. Trained over 9000 episodes with a reward shaping strategy that promotes cautious decision-making, the agent demonstrated robust performance in both subject-dependent and subject-independent evaluations. The model achieved a prediction horizon of up to 8.72 seconds prior to FOG onset in subject-independent scenarios and 7.89 seconds in subject-dependent settings. These results highlight the model’s potential for integration into wearable assistive devices, offering timely and personalized interventions to mitigate FOG in PD patients.

[LG-38] Adaptive Sensing of Continuous Physical Systems for Machine Learning

链接: https://arxiv.org/abs/2603.03650
作者: Felix Köster,Atsushi Uchida
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physical dynamical systems can be viewed as natural information processors: their systems preserve, transform, and disperse input information. This perspective motivates learning not only from data generated by such systems, but also how to measure them in a way that extracts the most useful information for a given task. We propose a general computing framework for adaptive information extraction from dynamical systems, in which a trainable attention module learns both where to probe the system state and how to combine these measurements to optimize prediction performance. As a concrete instantiation, we implement this idea using a spatiotemporal field governed by a partial differential equation as the underlying dynamics, though the framework applies equally to any system whose state can be sampled. Our results show that adaptive spatial sensing significantly improves prediction accuracy on canonical chaotic benchmarks. This work provides a perspective on attention-enhanced reservoir computing as a special case of a broader paradigm: neural networks as trainable measurement devices for extracting information from physical dynamical systems.

[LG-39] Riemannian Optimization in Modular Systems

链接: https://arxiv.org/abs/2603.03610
作者: Christian Pehle,Jean-Jacques Slotine
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Understanding how systems built out of modular components can be jointly optimized is an important problem in biology, engineering, and machine learning. The backpropagation algorithm is one such solution and has been instrumental in the success of neural networks. Despite its empirical success, a strong theoretical understanding of it is lacking. Here, we combine tools from Riemannian geometry, optimal control theory, and theoretical physics to advance this understanding. We make three key contributions: First, we revisit the derivation of backpropagation as a constrained optimization problem and combine it with the insight that Riemannian gradient descent trajectories can be understood as the minimum of an action. Second, we introduce a recursively defined layerwise Riemannian metric that exploits the modular structure of neural networks and can be efficiently computed using the Woodbury matrix identity, avoiding the O(n^3) cost of full metric inversion. Third, we develop a framework of composable ``Riemannian modules’’ whose convergence properties can be quantified using nonlinear contraction theory, providing algorithmic stability guarantees of order O(\kappa^2 L/(\xi \mu \sqrtn)) where \kappa and L are Lipschitz constants, \mu is the mass matrix scale, and \xi bounds the condition number. Our layerwise metric approach provides a practical alternative to natural gradient descent. While we focus here on studying neural networks, our approach more generally applies to the study of systems made of modules that are optimized over time, as it occurs in biology during both evolution and development. Comments: 9 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.03610 [cs.LG] (or arXiv:2603.03610v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.03610 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

链接: https://arxiv.org/abs/2603.03597
作者: Hadi Mohaghegh Dolatabadi,Thalaiyasingam Ajanthan,Sameera Ramasinghe,Chamin P Hewa Koneputugodage,Shamane Siriwardhana,Violetta Shevchenko,Karol Pajak,James Snewin,Gil Avraham,Alexander Long
类目: Machine Learning (cs.LG)
*备注: 47 pages, 22 figures, 18 tables

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon’s favorable convergence behavior.

[LG-41] MEM: Multi-Scale Embodied Memory for Vision Language Action Models

链接: https://arxiv.org/abs/2603.03596
作者: Marcel Torne,Karl Pertsch,Homer Walke,Kyle Vedder,Suraj Nair,Brian Ichter,Allen Z. Ren,Haohuan Wang,Jiaming Tang,Kyle Stachowicz,Karan Dhabalia,Michael Equi,Quan Vuong,Jost Tobias Springenberg,Sergey Levine,Chelsea Finn,Danny Driess
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot’s memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.

[LG-42] Hybrid Belief Reinforcement Learning for Efficient Coordinated Spatial Exploration

链接: https://arxiv.org/abs/2603.03595
作者: Danish Rizvi,David Boyle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief-reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.

[LG-43] SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

链接: https://arxiv.org/abs/2603.03592
作者: Hadi Mohaghegh Dolatabadi,Thalaiyasingam Ajanthan,Sameera Ramasinghe,Chamin P Hewa Koneputugodage,Gil Avraham,Yan Zuo,Violetta Shevchenko,Alexander Long
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 70 pages, 22 figures, 20 tables

点击查看摘要

Abstract:Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

[LG-44] stratum: A System Infrastructure for Massive Agent -Centric ML Workloads

链接: https://arxiv.org/abs/2603.03589
作者: Arnab Phani,Elias Strauss,Sebastian Schelter
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate, validate, and optimize complete ML pipelines. These agents predominantly operate over popular Python ML libraries and exhibit highly exploratory behavior. This results in thousands of executions for data profiling, pipeline generation, and iterative refinement of pipeline stages. However, the existing Python-based ML ecosystem is built around libraries such as Pandas and scikit-learn, which are designed for human-centric, interactive, sequential workflows and remain constrained by Python’s interpretive execution model, library-level isolation, and limited runtime support for executing large numbers of pipelines. Meanwhile, many high-performance ML systems proposed by the systems community either target narrow workload classes or require specialized programming models, which limits their integration with the Python ML ecosystem and makes them largely ill-suited for LLM-based agents. This growing mismatch exposes a fundamental systems challenge in supporting agentic pipeline search at scale. We therefore propose stratum, a unified system infrastructure that decouples pipeline execution from planning and reasoning during agentic pipeline search. Stratum integrates seamlessly with existing Python libraries, compiles batches of pipelines into optimized execution graphs, and efficiently executes them across heterogeneous backends, including a novel Rust-based runtime. We present stratum’s architectural vision along with an early prototype, discuss key design decisions, and outline open challenges and research directions. Finally, preliminary experiments show that stratum can significantly speed up large-scale agentic pipeline search up to 16.6x.

[LG-45] ransport Clustering: Solving Low-Rank Optimal Transport via Clustering

链接: https://arxiv.org/abs/2603.03578
作者: Henri Schmidt,Peter Halmos,Ben Raphael
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimal transport (OT) finds a least cost transport plan between two probability distributions using a cost matrix defined on pairs of points. Unlike standard OT, which infers unstructured pointwise mappings, low-rank optimal transport explicitly constrains the rank of the transport plan to infer latent structure. This improves statistical stability and robustness, yields sharper parametric rates for estimating Wasserstein distances adaptive to the intrinsic rank, and generalizes K -means to co-clustering. These advantages, however, come at the cost of a non-convex and NP-hard optimization problem. We introduce transport clustering, an algorithm to compute a low-rank OT plan that reduces low-rank OT to a clustering problem on correspondences obtained from a full-rank \textittransport registration step. We prove that this reduction yields polynomial-time, constant-factor approximation algorithms for low-rank OT: specifically, a (1+\gamma) approximation for negative-type metrics and a (1+\gamma+\sqrt2\gamma,) approximation for kernel costs, where \gamma \in [0,1] denotes the approximation ratio of the optimal full-rank solution relative to the low-rank optimal. Empirically, transport clustering outperforms existing low-rank OT solvers on synthetic benchmarks and large-scale, high-dimensional datasets.

[LG-46] Real-time tightly coupled GNSS and IMU integration via Factor Graph Optimization

链接: https://arxiv.org/abs/2603.03556
作者: Radu-Andrei Cioaca,Paul Irofti,Cristian Rusu,Gianluca Caparra,Andrei-Alexandru Marinache,Florin Stoican
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reliable positioning in dense urban environments remains challenging due to frequent GNSS signal blockage, multipath, and rapidly varying satellite geometry. While factor graph optimization (FGO)-based GNSS-IMU fusion has demonstrated strong robustness and accuracy, most formulations remain offline. In this work, we present a real-time tightly coupled GNSS-IMU FGO method that enables causal state estimation via incremental optimization with fixed-lag marginalization, and we evaluate its performance in a highly urbanized GNSS-degraded environment using the UrbanNav dataset.

[LG-47] Real-time loosely coupled GNSS and IMU integration via Factor Graph Optimization

链接: https://arxiv.org/abs/2603.03546
作者: Radu-Andrei Cioaca,Cristian Rusu,Paul Irofti,Gianluca Caparra,Andrei-Alexandru Marinache,Florin Stoican
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate positioning, navigation, and timing (PNT) is fundamental to the operation of modern technologies and a key enabler of autonomous systems. A very important component of PNT is the Global Navigation Satellite System (GNSS) which ensures outdoor positioning. Modern research directions have pushed the performance of GNSS localization to new heights by fusing GNSS measurements with other sensory information, mainly measurements from Inertial Measurement Units (IMU). In this paper, we propose a loosely coupled architecture to integrate GNSS and IMU measurements using a Factor Graph Optimization (FGO) framework. Because the FGO method can be computationally challenging and often used as a post-processing method, our focus is on assessing its localization accuracy and service availability while operating in real-time in challenging environments (urban canyons). Experimental results on the UrbanNav-HK-MediumUrban-1 dataset show that the proposed approach achieves real-time operation and increased service availability compared to batch FGO methods. While this improvement comes at the cost of reduced positioning accuracy, the paper provides a detailed analysis of the trade-offs between accuracy, availability, and computational efficiency that characterize real-time FGO-based GNSS/IMU fusion.

[LG-48] Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

链接: https://arxiv.org/abs/2603.03538
作者: Maria-Florina Balcan,Avrim Blum,Kiriaki Fragkia,Zhiyuan Li,Dravyansh Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.

[LG-49] rade-offs in Ensembling Merging and Routing Among Parameter-Efficient Experts

链接: https://arxiv.org/abs/2603.03535
作者: Sanae Lotfi,Lucas Caccia,Alessandro Sordoni,Jordan T. Ash,Miroslav Dudik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks, their performance on individual tasks depends on the fine-tuning strategy. Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies: ensembling, which combines outputs from independent models; merging, which fuses model weights via parameter averaging; and routing, which integrates models in an input-dependent fashion. However, many design decisions in these approaches remain understudied, and the relative benefits of more sophisticated ensembling, merging and routing techniques are not fully understood. We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? And does the flexibility of routing justify its complexity? Our findings indicate that non-uniform ensembling and merging improve performance, but routing offers even greater gains. To mitigate the computational cost of routing, we analyze expert selection techniques, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead. These insights advance our understanding of model fusion for multi-task learning.

[LG-50] Logit-Level Uncertainty Quantification in Vision-Language Models for Histopathology Image Analysis

链接: https://arxiv.org/abs/2603.03527
作者: Betul Yurdem,Ferhat Ozgur Catak,Murat Kuzlu,Mehmet Kemal Gullu
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) with their multimodal capabilities have demonstrated remarkable success in almost all domains, including education, transportation, healthcare, energy, finance, law, and retail. Nevertheless, the utilization of VLMs in healthcare applications raises crucial concerns due to the sensitivity of large-scale medical data and the trustworthiness of these models (reliability, transparency, and security). This study proposes a logit-level uncertainty quantification (UQ) framework for histopathology image analysis using VLMs to deal with these concerns. UQ is evaluated for three VLMs using metrics derived from temperature-controlled output logits. The proposed framework demonstrates a critical separation in uncertainty behavior. While VLMs show high stochastic sensitivity (cosine similarity (CS) 0.71 and 0.84 , Jensen-Shannon divergence (JS) 0.57 and 0.38 , and Kullback-Leibler divergence (KL) 0.55 and 0.35 , respectively for mean values of VILA-M3-8B and LLaVA-Med v1.5), near-maximal temperature impacts ( \Delta_T \approx 1.00 ), and displaying abrupt uncertainty transitions, particularly for complex diagnostic prompts. In contrast, the pathology-specific PRISM model maintains near-deterministic behavior (mean CS 0.90 , JS 0.10 , KL 0.09 ) and significantly minimal temperature effects across all prompt complexities. These findings emphasize the importance of logit-level uncertainty quantification to evaluate trustworthiness in histopathology applications utilizing VLMs.

[LG-51] Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

链接: https://arxiv.org/abs/2603.03523
作者: Shengbo Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with O(n) memory and O(n) computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal Q^* as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

[LG-52] Orbital Transformers for Predicting Wavefunctions in Time-Dependent Density Functional Theory

链接: https://arxiv.org/abs/2603.03511
作者: Xuan Zhang,Haiyang Yu,Chengdong Wang,Jacob Helwig,Shuiwang Ji,Xiaofeng Qian
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:We aim to learn wavefunctions simulated by time-dependent density functional theory (TDDFT), which can be efficiently represented as linear combination coefficients of atomic orbitals. In real-time TDDFT, the electronic wavefunctions of a molecule evolve over time in response to an external excitation, enabling first-principles predictions of physical properties such as optical absorption, electron dynamics, and high-order response. However, conventional real-time TDDFT relies on time-consuming propagation of all occupied states with fine time steps. In this work, we propose OrbEvo, which is based on an equivariant graph transformer architecture and learns to evolve the full electronic wavefunction coefficients across time steps. First, to account for external field, we design an equivariant conditioning to encode both strength and direction of external electric field and break the symmetry from SO(3) to SO(2). Furthermore, we design two OrbEvo models, OrbEvo-WF and OrbEvo-DM, using wavefunction pooling and density matrix as interaction method, respectively. Motivated by the central role of the density functional in TDDFT, OrbEvo-DM encodes the density matrix aggregated from all occupied electronic states into feature vectors via tensor contraction, providing a more intuitive approach to learn the time evolution operator. We adopt a training strategy specifically tailored to limit the error accumulation of time-dependent wavefunctions over autoregressive rollout. To evaluate our approach, we generate TDDFT datasets consisting of 5,000 different molecules in the QM9 dataset and 1,500 molecular configurations of the malonaldehyde molecule in the MD17 dataset. Results show that our OrbEvo model accurately captures quantum dynamics of excited states under external field, including time-dependent wavefunctions, time-dependent dipole moment, and optical absorption spectra.

[LG-53] Solving adversarial examples requires solving exponential misalignment

链接: https://arxiv.org/abs/2603.03507
作者: Alessandro Salvatore,Stanislav Fort,Surya Ganguli
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network’s perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human concepts. Since volume typically grows exponentially with dimension, this suggests exponential misalignment between machines and humans, with exponentially many inputs confidently assigned to concepts by machines but not humans. Furthermore, this provides a natural geometric hypothesis for the origin of adversarial examples: because a network’s PM fills such a large region of input space, any input will be very close to any class concept’s PM. Our hypothesis thus suggests that adversarial robustness cannot be attained without dimensional alignment of machine and human PMs, and therefore makes strong predictions: both robust accuracy and distance to any PM should be negatively correlated with the PM dimension. We confirmed these predictions across 18 different networks of varying robust accuracy. Crucially, we find even the most robust networks are still exponentially misaligned, and only the few PMs whose dimensionality approaches that of human concepts exhibit alignment to human perception. Our results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

[LG-54] When Small Variations Become Big Failures: Reliability Challenges in Compute-in-Memory Neural Accelerators

链接: https://arxiv.org/abs/2603.03491
作者: Yifan Qin,Jiahao Zheng,Zheyu Yan,Wujie Wen,Xiaobo Sharon Hu,Yiyu Shi
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 2026 International VLSI Symposium on Technology, Systems and Applications (VLSI TSA)

点击查看摘要

Abstract:Compute-in-memory (CiM) architectures promise significant improvements in energy efficiency and throughput for deep neural network acceleration by alleviating the von Neumann bottleneck. However, their reliance on emerging non-volatile memory devices introduces device-level non-idealities-such as write variability, conductance drift, and stochastic noise-that fundamentally challenge reliability, predictability, and safety, especially in safety-critical applications. This talk examines the reliability limits of CiM-based neural accelerators and presents a series of techniques that bridge device physics, architecture, and learning algorithms to address these challenges. We first demonstrate that even small device variations can lead to disproportionately large accuracy degradation and catastrophic failures in safety-critical inference workloads, revealing a critical gap between average-case evaluations and worst-case behavior. Building on this insight, we introduce SWIM, a selective write-verify mechanism that strategically applies verification only where it is most impactful, significantly improving reliability while maintaining CiM’s efficiency advantages. Finally, we explore a learning-centric solution that improves realistic worst-case performance by training neural networks with right-censored Gaussian noise, aligning training assumptions with hardware-induced variability and enabling robust deployment without excessive hardware overhead. Together, these works highlight the necessity of cross-layer co-design for CiM accelerators and provide a principled path toward dependable, efficient neural inference on emerging memory technologies-paving the way for their adoption in safety- and reliability-critical systems.

[LG-55] Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

链接: https://arxiv.org/abs/2603.03480
作者: Harin Lee,Kevin Jamieson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of \tilde\mathcalO(H \sqrtD_\max SAK) , where S and A are the cardinalities of the state and action spaces, H is the time horizon, K is the number of episodes, and D_\max is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

[LG-56] Biased Generalization in Diffusion Models

链接: https://arxiv.org/abs/2603.03469
作者: Jerome Garnier-Brun,Luca Biggio,Davide Beltrame,Marc Mézard,Luca Saglietti
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Generalization in generative modeling is defined as the ability to learn an underlying distribution from a finite dataset and produce novel samples, with evaluation largely driven by held-out performance and perceived sample quality. In practice, training is often stopped at the minimum of the test loss, taken as an operational indicator of generalization. We challenge this viewpoint by identifying a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data. By training the same network on two disjoint datasets and comparing the mutual distances of generated samples and their similarity to training data, we introduce a quantitative measure of bias and demonstrate its presence on real images. We then study the mechanism of bias, using a controlled hierarchical data model where access to exact scores and ground-truth statistics allows us to precisely characterize its onset. We attribute this phenomenon to the sequential nature of feature learning in deep networks, where coarse structure is learned early in a data-independent manner, while finer features are resolved later in a way that increasingly depends on individual training samples. Our results show that early stopping at the test loss minimum, while optimal under standard generalization criteria, may be insufficient for privacy-critical applications.

[LG-57] [Re] FairDICE: A Gap Between Theory And Practice

链接: https://arxiv.org/abs/2603.03454
作者: Peter Adema,Karim Galliamov,Aleksey Evstratovskiy,Ross Geurts
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures in main text. Code at this https URL

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g.\ incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

[LG-58] A Short Note on a Variant of the Squint Algorithm

链接: https://arxiv.org/abs/2603.03409
作者: Haipeng Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This short note describes a simple variant of the Squint algorithm of Koolen and Van Erven [2015] for the classic expert problem. Via an equally simple modification of their proof, we prove that this variant ensures a regret bound that resembles the one shown in a recent work by Freund et al. [2026] for a variant of the NormalHedge algorithm [Chaudhuri et al., 2009].

[LG-59] owards Improved Sentence Representations using Token Graphs ICLR2026

链接: https://arxiv.org/abs/2603.03389
作者: Krishna Sri Ipsit Mantri,Carola-Bibiane Schönlieb,Zorah Lähner,Moshe Eliasof
类目: Machine Learning (cs.LG)
*备注: ICLR 2026, 29 Pages, 17 Tables, 5 Figures

点击查看摘要

Abstract:Obtaining a single-vector representation from a Large Language Model’s (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model’s self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at this https URL.

[LG-60] SELDON: Supernova Explosions Learned by Deep ODE Networks AAAI2026 AAAI

链接: https://arxiv.org/abs/2603.04392
作者: Jiezhong Wu,Jack O’Brien,Jennifer Li,M. S. Krafczyk,Ved G. Shah,Amanda R. Wasserman,Daniel W. Apley,Gautham Narayan,Noelle I. Samia
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted at AAAI 2026 (Proceedings of the AAAI Conference on Artificial Intelligence)

点击查看摘要

Abstract:The discovery rate of optical transients will explode to 10 million public alerts per night once the Vera C. Rubin Observatory’s Legacy Survey of Space and Time comes online, overwhelming the traditional physics-based inference pipelines. A continuous-time forecasting AI model is of interest because it can deliver millisecond-scale inference for thousands of objects per day, whereas legacy MCMC codes need hours per object. In this paper, we propose SELDON, a new continuous-time variational autoencoder for panels of sparse and irregularly time-sampled (gappy) astrophysical light curves that are nonstationary, heteroscedastic, and inherently dependent. SELDON combines a masked GRU-ODE encoder with a latent neural ODE propagator and an interpretable Gaussian-basis decoder. The encoder learns to summarize panels of imbalanced and correlated data even when only a handful of points are observed. The neural ODE then integrates this hidden state forward in continuous time, extrapolating to future unseen epochs. This extrapolated time series is further encoded by deep sets to a latent distribution that is decoded to a weighted sum of Gaussian basis functions, the parameters of which are physically meaningful. Such parameters (e.g., rise time, decay rate, peak flux) directly drive downstream prioritization of spectroscopic follow-up for astrophysical surveys. Beyond astronomy, the architecture of SELDON offers a generic recipe for interpretable and continuous-time sequence modeling in any time domain where data are multivariate, sparse, heteroscedastic, and irregularly spaced.

[LG-61] Semi-Supervised Generative Learning via Latent Space Distribution Matching

链接: https://arxiv.org/abs/2603.04223
作者: Kwong Yu Chong,Long Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Latent Space Distribution Matching (LSDM), a novel framework for semi-supervised generative modeling of conditional distributions. LSDM operates in two stages: (i) learning a low-dimensional latent space from both paired and unpaired data, and (ii) performing joint distribution matching in this space via the 1-Wasserstein distance, using only paired data. This two-step approach minimizes an upper bound on the 1-Wasserstein distance between joint distributions, reducing reliance on scarce paired samples while enabling fast one-step generation. Theoretically, we establish non-asymptotic error bounds and demonstrate a key benefit of unpaired data: enhanced geometric fidelity in generated outputs. Furthermore, by extending the scope of its two core steps, LSDM provides a coherent statistical perspective that connects to a broad class of latent-space approaches. Notably, Latent Diffusion Models (LDMs) can be viewed as a variant of LSDM, in which joint distribution matching is achieved indirectly via score matching. Consequently, our results also provide theoretical insights into the consistency of LDMs. Empirical evaluations on real-world image tasks, including class-conditional generation and image super-resolution, demonstrate the effectiveness of LSDM in leveraging unpaired data to enhance generation quality.

[LG-62] Bayesian Adversarial Privacy

链接: https://arxiv.org/abs/2603.04199
作者: Cameron Bell,Timothy Johnston,Antoine Luciano,Christian P Robert
类目: atistics Theory (math.ST); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Theoretical and applied research into privacy encompasses an incredibly broad swathe of differing approaches, emphasis and aims. This work introduces a new quantitative notion of privacy that is both contextual and specific. We argue that it provides a more meaningful notion of privacy than the widely utilised framework of differential privacy and a more explicit and rigorous formulation than what is commonly used in statistical disclosure theory. Our definition relies on concepts inherent to standard Bayesian decision theory, while departing from it in several important respects. In particular, the party controlling the release of sensitive information should make disclosure decisions from the prior viewpoint, rather than conditional on the data, even when the data is itself observed. Illuminating toy examples and computational methods are discussed in high detail in order to highlight the specificities of the method.

[LG-63] Stable and Steerable Sparse Autoencoders with Weight Regularization

链接: https://arxiv.org/abs/2603.04198
作者: Piotr Jedryszek,Oliver M. Crook
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

[LG-64] Exploiting Subgradient Sparsity in Max-Plus Neural Networks

链接: https://arxiv.org/abs/2603.04133
作者: Ikhlas Enaieh(LTCI, S2A),Olivier Fercoq(S2A, LTCI)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Neural Networks are powerful tools for solving machine learning problems, but their training often involves dense and costly parameter updates. In this work, we use a novel Max-Plus neural architecture in which classical addition and multiplication are replaced with maximum and summation operations respectively. This is a promising architecture in terms of interpretability, but its training is challenging. A particular feature is that this algebraic structure naturally induces sparsity in the subgradients, as only neurons that contribute to the maximum affect the loss. However, standard backpropagation fails to exploit this sparsity, leading to unnecessary computations. In this work, we focus on the minimization of the worst sample loss which transfers this sparsity to the optimization loss. To address this, we propose a sparse subgradient algorithm that explicitly exploits the algebraic sparsity. By tailoring the optimization procedure to the non-smooth nature of Max-Plus models, our method achieves more efficient updates while retaining theoretical guarantees. This highlights a principled path toward bridging algebraic structure and scalable learning.

[LG-65] Fermi-Dirac thermal measurements: A framework for quantum hypothesis testing and semidefinite optimization

链接: https://arxiv.org/abs/2603.04061
作者: Nana Liu,Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 35 pages, 3 figures

点击查看摘要

Abstract:Quantum measurements are the means by which we recover messages encoded into quantum states. They are at the forefront of quantum hypothesis testing, wherein the goal is to perform an optimal measurement for arriving at a correct conclusion. Mathematically, a measurement operator is Hermitian with eigenvalues in [0,1]. By noticing that this constraint on each eigenvalue is the same as that imposed on fermions by the Pauli exclusion principle, we interpret every eigenmode of a measurement operator as an independent effective fermionic mode. Under this perspective, various objective functions in quantum hypothesis testing can be viewed as the total expected energy associated with these fermionic occupation numbers. By instead fixing a temperature and minimizing the total expected fermionic free energy, we find that optimal measurements for these modified objective functions are Fermi-Dirac thermal measurements, wherein their eigenvalues are specified by Fermi-Dirac distributions. In the low-temperature limit, their performance closely approximates that of optimal measurements for quantum hypothesis testing, and we show that their parameters can be learned by classical or hybrid quantum-classical optimization algorithms. This leads to a new quantum machine-learning model, termed Fermi-Dirac machines, consisting of parameterized Fermi-Dirac thermal measurements-an alternative to quantum Boltzmann machines based on thermal states. Beyond hypothesis testing, we show how general semidefinite optimization problems can be solved using this approach, leading to a novel paradigm for semidefinite optimization on quantum computers, in which the goal is to implement thermal measurements rather than prepare thermal states. Finally, we propose quantum algorithms for implementing Fermi-Dirac thermal measurements, and we also propose second-order hybrid quantum-classical optimization algorithms.

[LG-66] Invariance-Based Dynamic Regret Minimization

链接: https://arxiv.org/abs/2603.03843
作者: Margherita Lazzaretto,Jonas Peters,Niklas Pfister
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 7 figures

点击查看摘要

Abstract:We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.

[LG-67] Non-Invasive Reconstruction of Cardiac Activation Dynamics Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2603.03832
作者: Nathan Dermul,Hans Dierckx
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiac arrhythmogenesis is governed by complex electromechanical interactions that are not directly observable in vivo, motivating the development of non-invasive computational approaches for reconstructing three-dimensional activation dynamics. We present a physics-informed neural network framework for recovering cardiac activation patterns, active tension propagation, deformation fields, and hydrostatic pressure from measurable deformation data in simplified left ventricular geometries. Our approach integrates nonlinear anisotropic constitutive modeling, heterogeneous fiber orientation, weak formulations of the governing mechanics, and finite-element-based loss functions to embed physical constraints directly into training. We demonstrate that the proposed framework accurately reconstructs spatiotemporal activation dynamics under varying levels of measurement noise and reduced spatial resolution, while preserving global propagation patterns and activation timing. By coupling mechanistic modeling with data-driven inference, this method establishes a pathway toward patient-specific, non-invasive reconstruction of cardiac activation, with potential applications in digital phenotyping and computational support for arrhythmia assessment. Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG) Cite as: arXiv:2603.03832 [physics.med-ph] (or arXiv:2603.03832v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2603.03832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Observationally Informed Adaptive Causal Experimental Design

链接: https://arxiv.org/abs/2603.03785
作者: Erdun Gao,Liang Zhang,Jake Fawkes,Aoqi Zuo,Wenqin Liu,Haoxuan Li,Mingming Gong,Dino Sejdinovic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized Controlled Trials (RCTs) represent the gold standard for causal inference yet remain a scarce resource. While large-scale observational data is often available, it is utilized only for retrospective fusion, and remains discarded in prospective trial design due to bias concerns. We argue this “tabula rasa” data acquisition strategy is fundamentally inefficient. In this work, we propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior. This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias. To operationalize this, we introduce the R-Design framework. Theoretically, we establish two key advantages: (1) a structural efficiency gap, proving that estimating smooth residual contrasts admits strictly faster convergence rates than reconstructing full outcomes; and (2) information efficiency, where we quantify the redundancy in standard parameter-based acquisition (e.g., BALD), demonstrating that such baselines waste budget on task-irrelevant nuisance uncertainty. We propose R-EPIG (Residual Expected Predictive Information Gain), a unified criterion that directly targets the causal estimand, minimizing residual uncertainty for estimation or clarifying decision boundaries for policy. Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines, confirming that repairing a biased model is far more efficient than learning one from scratch.

[LG-69] Riemannian Langevin Dynamics: Strong Convergence of Geometric Euler-Maruyama Scheme

链接: https://arxiv.org/abs/2603.03626
作者: Zhiyuan Zhan,Masashi Sugiyama
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Low-dimensional structure in real-world data plays an important role in the success of generative models, which motivates diffusion models defined on intrinsic data manifolds. Such models are driven by stochastic differential equations (SDEs) on manifolds, which raises the need for convergence theory of numerical schemes for manifold-valued SDEs. In Euclidean space, the Euler–Maruyama (EM) scheme achieves strong convergence with order 1/2 , but an analogous result for manifold discretizations is less understood in general settings. In this work, we study a geometric version of the EM scheme for SDEs on Riemannian manifolds and prove strong convergence with order 1/2 under geometric and regularity conditions. As an application, we obtain a Wasserstein bound for sampling on manifolds via the geometric EM discretization of Riemannian Langevin dynamics.

[LG-70] Controllable Generative Sandbox for Causal Inference ICML2026

链接: https://arxiv.org/abs/2603.03587
作者: Qi Zhang,Harsh Parikh,Ashley Naimi,Razieh Nabi,Christopher Kim,Timothy Lash
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 15 figures. Submitted to ICML 2026. Code available at this https URL

点击查看摘要

Abstract:Method validation and study design in causal inference rely on synthetic data with known counterfactuals. Existing simulators trade off distributional realism, the ability to capture mixed-type and multimodal tabular data, against causal controllability, including explicit control over overlap, unmeasured confounding, and treatment effect heterogeneity. We introduce CausalMix, a variational generative framework that closes this gap by coupling a mixture of Gaussian latent priors with data-type-specific decoders for continuous, binary, and categorical variables. The model incorporates explicit causal controls: an overlap regularizer shaping propensity-score distributions, alongside direct parameterizations of confounding strength and effect heterogeneity. This unified objective preserves fidelity to the observed data while enabling factorial manipulation of causal mechanisms, allowing overlap, confounding strength, and treatment effect heterogeneity to be varied independently at design time. Across benchmarks, CausalMix achieves state-of-the-art distributional metrics on mixed-type tables while providing stable, fine-grained causal control. We demonstrate practical utility in a comparative safety study of metastatic castration-resistant prostate cancer treatments, using CausalMix to compare estimators under calibrated data-generating processes, tune hyperparameters, and conduct simulation-based power analyses under targeted treatment effect heterogeneity scenarios.

[LG-71] Quantifying Ranking Instability Across Evaluation Protocol Axes in Gene Regulatory Network Benchmarking

链接: https://arxiv.org/abs/2603.03493
作者: Ihor Kendiukhov
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmark rankings are routinely used to justify scientific claims about method quality in gene regulatory network (GRN) inference, yet the stability of these rankings under plausible evaluation protocol choices is rarely examined. We present a systematic diagnostic framework for measuring ranking instability under protocol shift, including decomposition tools that separate base rate effects from discrimination effects. Using existing single cell GRN benchmark outputs across three human tissues and six inference methods, we quantify pairwise reversal rates across four protocol axes: candidate set restriction (16.3 percent, 95 percent CI 11.0 to 23.4 percent), tissue context (19.3 percent), reference network choice (32.1 percent), and symbol mapping policy (0.0 percent). A permutation null confirms that observed reversal rates are far below random order expectations (0.163 versus null mean 0.500), indicating partially stable but non invariant ranking structure. Our decomposition reveals that reversals are driven by changes in the relative discrimination ability of methods rather than by base rate inflation, a finding that challenges a common implicit assumption in GRN benchmarking. We propose concrete reporting practices for stability aware evaluation and provide a diagnostic toolkit for identifying method pairs at risk of reversal.

[LG-72] Scalable Contrastive Causal Discovery under Unknown Soft Interventions

链接: https://arxiv.org/abs/2603.03411
作者: Mingxuan Zhang,Khushi Desai,Sopho Kevlishvili,Elham Azizi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Observational causal discovery is only identifiable up to the Markov equivalence class. While interventions can reduce this ambiguity, in practice interventions are often soft with multiple unknown targets. In many realistic scenarios, only a single intervention regime is observed. We propose a scalable causal discovery model for paired observational and interventional settings with shared underlying causal structure and unknown soft interventions. The model aggregates subset-level PDAGs and applies contrastive cross-regime orientation rules to construct a globally consistent maximal PDAG under Meek closure, enabling generalization to both in-distribution and out-of-distribution settings. Theoretically, we prove that our model is sound with respect to a restricted \Psi equivalence class induced solely by the information available in the subset-restricted setting. We further show that the model asymptotically recovers the corresponding identifiable PDAG and can orient additional edges compared to non-contrastive subset-restricted methods. Experiments on synthetic data demonstrate improved causal structure recovery, generalization to unseen graphs with held-out causal mechanisms, and scalability to larger graphs, with ablations supporting the theoretical results.

[LG-73] Surprisal-Rényi Free Energy

链接: https://arxiv.org/abs/2603.03405
作者: Shion Matsumoto,Raul Castillo,Benjamin Prada,Ankur Arjun Mali
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-Rényi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of f -divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.

[LG-74] Beyond Cross-Validation: Adaptive Parameter Selection for Kernel-Based Gradient Descents

链接: https://arxiv.org/abs/2603.03401
作者: Xiaotong Liu,Yunwen Lei,Xiangyu Chang,Shao-Bo Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper proposes a novel parameter selection strategy for kernel-based gradient descent (KGD) algorithms, integrating bias-variance analysis with the splitting method. We introduce the concept of empirical effective dimension to quantify iteration increments in KGD, deriving an adaptive parameter selection strategy that is implementable. Theoretical verifications are provided within the framework of learning theory. Utilizing the recently developed integral operator approach, we rigorously demonstrate that KGD, equipped with the proposed adaptive parameter selection strategy, achieves the optimal generalization error bound and adapts effectively to different kernels, target functions, and error metrics. Consequently, this strategy showcases significant advantages over existing parameter selection methods for KGD.

[LG-75] he Theory behind UMAP?

链接: https://arxiv.org/abs/2603.03375
作者: David Wegmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Category Theory (math.CT)
*备注: This article is derived from my masters thesis

点击查看摘要

Abstract:In 2018, McInnes et al. introduced a dimensionality reduction algorithm called UMAP, which enjoys wide popularity among data scientists. Their work introduces a finite variant of a functor called the metric realization, based on an unpublished draft by Spivak. This draft contains many errors, most of which are reproduced by McInnes et al. and subsequent publications. This article aims to repair these errors and provide a self-contained document with the full derivation of Spivak’s functors and McInnes et al.'s finite variant. We contribute an explicit description of the metric realization and related functors. At the end, we discuss the UMAP algorithm, as well as claims about properties of the algorithm and the correspondence of McInnes et al.'s finite variant to the UMAP algorithm.

[LG-76] Automated Measurement of Geniohyoid Muscle Thickness During Speech Using Deep Learning and Ultrasound INTERSPEECH2026

链接: https://arxiv.org/abs/2603.03350
作者: Alisher Myrgyyassov,Bruce Xiao Wang,Yu Sun,Shuming Huang,Zhen Song,Min Ney Wong,Yongping Zheng
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 6 pages, including references and acknowledgements. Submitted to Interspeech 2026

点击查看摘要

Abstract:Manual measurement of muscle morphology from ultrasound during speech is time-consuming and limits large-scale studies. We present SMMA, a fully automated framework that combines deep-learning segmentation with skeleton-based thickness quantification to analyze geniohyoid (GH) muscle dynamics. Validation demonstrates near-human-level accuracy (Dice = 0.9037, MAE = 0.53 mm, r = 0.901). Application to Cantonese vowel production (N = 11) reveals systematic patterns: /a:/ shows significantly greater GH thickness (7.29 mm) than /i:/ (5.95 mm, p 0.001, Cohen’s d 1.3), suggesting greater GH activation during production of /a:/ than /i:/, consistent with its role in mandibular depression. Sex differences (5-8% greater in males) reflect anatomical scaling. SMMA achieves expert-validated accuracy while eliminating the need for manual annotation, enabling scalable investigations of speech motor control and objective assessment of speech and swallowing disorders.

附件下载

点击下载今日全部论文列表