本篇博文主要内容为 2026-04-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-02)

今日共更新663篇论文,其中:

  • 自然语言处理106篇(Computation and Language (cs.CL))
  • 人工智能173篇(Artificial Intelligence (cs.AI))
  • 计算机视觉136篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习164篇(Machine Learning (cs.LG))
  • 多智能体系统18篇(Multiagent Systems (cs.MA))
  • 信息检索16篇(Information Retrieval (cs.IR))
  • 人机交互24篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Collaborative Task and Path Planning for Heterogeneous Robotic Teams using Multi-Agent PPO

【速读】:该论文旨在解决外星探索中多机器人团队在复杂目标分配与调度问题上的高效协同规划难题,核心挑战在于如何在任务规模扩大时仍保持高效率的决策能力,避免传统规划算法因组合爆炸导致的计算开销剧增。解决方案的关键在于采用基于多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)的协作式学习策略,将原本难以实时求解的组合优化问题转化为可训练的强化学习模型,从而实现在线重规划能力,并显著降低推理阶段的计算成本,提升科学探测价值的提取效率。

链接: https://arxiv.org/abs/2604.01213
作者: Matthias Rubio,Julia Richter,Hendrik Kolvenbach,Marco Hutter
机构: Robotic Systems Lab (RSL), ETH Zürich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 8 pages, 3 figures, associated code on this https URL

点击查看摘要

Abstract:Efficient robotic extraterrestrial exploration requires robots with diverse capabilities, ranging from scientific measurement tools to advanced locomotion. A robotic team enables the distribution of tasks over multiple specialized subsystems, each providing specific expertise to complete the mission. The central challenge lies in efficiently coordinating the team to maximize utilization and the extraction of scientific value. Classical planning algorithms scale poorly with problem size, leading to long planning cycles and high inference costs due to the combinatorial growth of possible robot-target allocations and possible trajectories. Learning-based methods are a viable alternative that move the scaling concern from runtime to training time, setting a critical step towards achieving real-time planning. In this work, we present a collaborative planning strategy based on Multi-Agent Proximal Policy Optimization (MAPPO) to coordinate a team of heterogeneous robots to solve a complex target allocation and scheduling problem. We benchmark our approach against single-objective optimal solutions obtained through exhaustive search and evaluate its ability to perform online replanning in the context of a planetary exploration scenario.

[MA-1] Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

【速读】:该论文旨在解决多智能体系统中隐蔽协作(covert coordination)的检测问题,尤其是在环境分布发生偏移时如何有效识别代理间的合谋行为。传统基于文本层面的监控手段难以捕捉此类复杂交互,而模型内部激活状态(activation space)可能蕴含更细粒度的协同信号。解决方案的关键在于提出NARCBench基准测试平台,并设计五种聚合单代理欺骗得分的探测技术(probing techniques),以在群体层面分类协作场景;实验表明这些方法在分布内达到1.00 AUROC,在零样本迁移至结构不同的多智能体场景和隐写式黑杰克计牌任务中仍保持0.60–0.86 AUROC,且发现不同类型的协作在激活空间中的表现形式各异,提示需结合多种探测策略。此外,初步证据显示该信号具有token级局部性,即合谋代理在处理同伴信息编码部分时激活显著增强,从而为多智能体可解释性提供新方向——从单模型白盒分析扩展至跨代理信号整合。

链接: https://arxiv.org/abs/2604.01151
作者: Aaron Rose,Carissa Cullen,Brandon Gary Kaplowitz,Christian Schroeder de Witt
机构: University of Oxford (牛津大学); New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60–0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent’s activations spiking specifically when processing the encoded parts of their partner’s message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at this https URL.

[MA-2] OrgAgent : Organize Your Multi-Agent System like a Company

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)驱动的多智能体系统中如何有效组织多个智能体以提升复杂推理能力的问题。现有研究尚未明确最优的组织结构,导致协作效率低下或资源浪费。解决方案的关键在于提出OrgAgent框架——一种类公司层级的多智能体架构,将协作划分为治理层(Governance Layer)、执行层(Execution Layer)和合规层(Compliance Layer),分别负责规划与资源配置、任务求解与审查、以及最终答案控制。该结构通过分层协调机制实现稳定技能分配、可控信息流和逐层验证,显著优于扁平式协作,在多项推理任务上提升了性能并降低了token消耗。

链接: https://arxiv.org/abs/2604.01020
作者: Yiru Wang,Xinyue Shen,Yaohui Han,Michael Backes,Pin-Yu Chen,Tsung-Yi Ho
机构: The Chinese University of Hong Kong; IBM Research; CISPA Helmholtz Center for Information Security
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language model-based multi-agent systems have shown strong potential for complex reasoning, how to effectively organize multiple agents remains an open question. In this paper, we introduce OrgAgent, a company-style hierarchical multi-agent framework that separates collaboration into governance, execution, and compliance layers. OrgAgent decomposes multi-agent reasoning into three layers: a governance layer for planning and resource allocation, an execution layer for task solving and review, and a compliance layer for final answer control. By evaluating the framework across reasoning tasks, LLMs, execution modes, and execution policies, we find that multi-agent systems organized in a company-style hierarchy generally outperform other organizational structures. Besides, hierarchical coordination also reduces token consumption relative to flat collaboration in most settings. For example, for GPT-OSS-120B, the hierarchical setting improves performance over flat multi-agent system by 102.73% while reducing token usage by 74.52% on SQuAD 2.0. Further analysis shows that hierarchy helps most when tasks benefit from stable skill assignment, controlled information flow, and layered verification. Overall, our findings highlight organizational structure as an important factor in multi-agent reasoning, shaping not only effectiveness and cost, but also coordination behavior.

[MA-3] Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 数字助理在开发中因缺乏真实用户模拟框架而导致的瓶颈问题。现有方法将应用程序建模为无状态的工具调用 API,无法捕捉用户在数字环境中交互的状态性和序列性特征,从而阻碍了真实用户行为的模拟。其解决方案的关键在于提出 Proactive Agent Research Environment (Pare),该框架将应用建模为带有状态转移和状态依赖动作空间的有限状态机(Finite State Machine, FSM),从而支持主动式用户模拟;在此基础上构建的 Pare-Bench 基准测试集包含 143 个跨通信、生产力、日程安排与生活方式类应用的多样化任务,用于评估代理在上下文感知、目标推断、干预时机把握及多应用协同方面的性能。

链接: https://arxiv.org/abs/2604.00842
作者: Deepak Nathani,Cheng Zhang,Chang Huan,Jiaming Shan,Yinfei Yang,Alkesh Patel,Zhe Gan,William Yang Wang,Michael Saxon,Xin Eric Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Apple (苹果公司); University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 34 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive agents in digital environments. Pare models applications as finite state machines with stateful navigation and state-dependent action space for the user simulator, enabling active user simulation. Building on this foundation, we present Pare-Bench, a benchmark of 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps, designed to test context observation, goal inference, intervention timing, and multi-app orchestration.

[MA-4] Role Differentiation in a Coupled Resource Ecology under Multi-Level Selection

【速读】:该论文试图解决的问题是:在个体层面持续选择压力下,群体如何通过自组织机制避免因资源竞争而导致的“公地悲剧”(tragedy-of-the-commons),即多个非合作个体因争夺同一资源通道而引发系统性崩溃。其解决方案的关键在于引入一种多层选择(multi-level selection)计算模型,其中群体层面的选择塑造一个所有个体共享的共同基质(common substrate)和突变算子,同时将该过程置于具身生态(embodied ecology)中——在此环境中,不同资源通道虽未物理隔离,但通过相同的行为原语(behavioral primitives)耦合,并被划分为正和获取通道(positive-sum intake channel)与零和再分配通道(zero-sum redistribution channel)。研究表明,在由出生与死亡驱动的群体周转机制下,这种架构能够自发催生角色分化,维持两个通道的共存,从而规避单一获取模式的演化坍塌;且零和通道使用量随世代增加,尽管它并未直接由群体选择优化。这表明,多层选择可通过耦合通道的差异化利用,使群体在持续个体选择压力下实现稳定协作。

链接: https://arxiv.org/abs/2604.00810
作者: Siddharth Chaturvedi,Ahmed El-Gazzar,Marcel van Gerven
机构: Radboud University (奈梅亨大学)
类目: Multiagent Systems (cs.MA)
备注: 9 pages, 6 figures, 1 table

点击查看摘要

Abstract:A group of non-cooperating agents can succumb to the \emphtragedy-of-the-commons if all of them seek to maximize the same resource channel to improve their viability. In nature, however, groups often avoid such collapses by differentiating into distinct roles that exploit different resource channels. It remains unclear how such coordination can emerge under continual individual-level selection alone. To address this, we introduce a computational model of multi-level selection, in which group-level selection shapes a common substrate and mutation operator shared by all group members undergoing individual-level selection. We also place this process in an embodied ecology where distinct resource channels are not segregated, but coupled through the same behavioral primitives. These channels are classified as a positive-sum intake channel and a zero-sum redistribution channel. We investigate whether such a setting can give rise to role differentiation under turnover driven by birth and death. We find that in a learned ecology, both channels remain occupied at the colony level, and the collapse into a single acquisition mode is avoided. Zero-sum channel usage increases over generations despite not being directly optimized by group-level selection. Channel occupancy also fluctuates over the lifetime of a boid. Ablation studies suggest that most baseline performance is carried by the inherited behavioral basis, while the learned variation process provides a smaller but systematic improvement prior to saturation. Together, the results suggest that multi-level selection can enable groups in a common-pool setting to circumvent tragedy-of-the-commons through differentiated use of coupled channels under continual turnover.

[MA-5] GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

【速读】:该论文旨在解决多智能体强化学习中因策略并发更新导致的非平稳性(non-stationarity)问题,这种非平稳性引发环境持续波动,使现有方法如集中训练分散执行(CTDE)和顺序更新机制仍存在收敛缓慢与均衡振荡的问题。其解决方案的关键在于提出梯度对齐主动共享感知(Gradient Realignment via Active Shared Perception, GRASP)框架,通过引入独立智能体梯度推导共识梯度(consensus gradient),使智能体能够主动感知其他智能体的策略更新并优化团队协作,从而将广义贝尔曼均衡(generalized Bellman equilibrium)定义为稳定的目标函数;理论层面利用Kakutani不动点定理证明了共识方向 $ u^* $ 可保证该均衡的存在性和可达性,实验在StarCraft II Multi-Agent Challenge (SMAC) 和 Google Research Football (GRF) 上验证了该框架的可扩展性与优异性能。

链接: https://arxiv.org/abs/2604.00717
作者: Sihan Zhou,Tiantian He,Yifan Lu,Yaqing Hou,Yew-Soon Ong
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-stationarity arises from concurrent policy updates and leads to persistent environmental fluctuations. Existing approaches like Centralized Training with Decentralized Execution (CTDE) and sequential update schemes mitigate this issue. However, since the perception of the policies of other agents remains dependent on sampling environmental interaction data, the agent essentially operates in a passive perception state. This inevitably triggers equilibrium oscillations and significantly slows the convergence speed of the system. To address this issue, we propose Gradient Realignment via Active Shared Perception (GRASP), a novel framework that defines generalized Bellman equilibrium as a stable objective for policy evolution. The core mechanism of GRASP involves utilizing the independent gradients of agents to derive a defined consensus gradient, enabling agents to actively perceive policy updates and optimize team collaboration. Theoretically, we leverage the Kakutani Fixed-Point Theorem to prove that the consensus direction u^* guarantees the existence and attainability of this equilibrium. Extensive experiments on StarCraft II Multi-Agent Challenge (SMAC) and Google Research Football (GRF) demonstrate the scalability and promising performance of the framework.

[MA-6] Lipschitz Dueling Bandits over Continuous Action Spaces

【速读】:该论文旨在解决连续动作空间中具有Lipschitz结构的随机对决强化学习(stochastic dueling bandits)问题,其中仅能获得比较性反馈(relative feedback)。此前,对决强化学习与Lipschitz强化学习虽各自被广泛研究,但二者结合仍属空白。解决方案的关键在于提出首个针对Lipschitz对决强化学习的算法,其核心机制为基于自适应参考臂(adaptive reference arm)的分轮次探索(round-based exploration)和递归区域消除(recursive region elimination),并设计了新的分析工具以处理相对反馈。理论证明该算法在时间T内达到O~(Tdz+1dz+2)\tilde O\left(T^\frac{d_z+1}{d_z+2}\right)的 regret 上界,其中 dzd_z 为近最优区域的zooming维度;同时算法空间复杂度仅为对数级别,是连续动作空间下任何强化学习算法能达到的最佳空间效率。

链接: https://arxiv.org/abs/2604.00523
作者: Mudit Sharma,Shweta Jain,Vaneet Aggarwal,Ganesh Ghalme
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of \tilde O\left(T^\fracd_z+1d_z+2\right) , where d_z is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.

[MA-7] Competition and Cooperation of LLM Agents in Games

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在竞争性多代理环境中的行为收敛性与策略特征问题,特别是其是否趋向于纳什均衡(Nash equilibrium)以及如何刻画其战略行为。研究发现,在多轮提示和非零和情境下,LLM代理倾向于合作而非收敛至纳什均衡;通过链式思维(Chain-of-thought)分析揭示,公平性推理(fairness reasoning)是驱动此类合作行为的核心机制。论文提出了一种解析框架,能够捕捉LLM代理在多轮交互中推理动态的变化,从而解释实验观察到的合作现象。

链接: https://arxiv.org/abs/2604.00487
作者: Jiayi Yao,Cong Chen,Baosen Zhang
机构: University of Washington (华盛顿大学); Dartmouth College (达特茅斯学院)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed in competitive multi-agent settings, raising fundamental questions about whether they converge to equilibria and how their strategic behavior can be characterized. In this paper, we study LLM agent interactions in two standard games: a network resource allocation game and a Cournot competition game. Rather than converging to Nash equilibria, we find that LLM agents tend to cooperate when given multi-round prompts and non-zero-sum context. Chain-of-thought analysis reveals that fairness reasoning is central to this behavior. We propose an analytical framework that captures the dynamics of LLM agent reasoning across rounds and explains these experimental findings.

[MA-8] Logarithmic Scores Power-Law Discoveries: Disentangling Measurement from Coverag e in Agent -Based Evaluation

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的代理裁判(agent judges)在评估对话式人工智能(conversational AI)时的可信度问题,即:能否信任其评分结果?以及需要多少个代理裁判才能获得可靠评估。解决方案的关键在于引入基于人格特质(Big Five personality traits)的结构化角色设定来训练代理裁判,并通过实证验证其有效性——研究发现,此类代理裁判的评估结果在图灵式验证中与人类评分者无显著差异;同时揭示了评分精度与问题发现广度之间的非线性关系:评分随面板规模对数增长而趋近饱和,而新问题的发现则遵循次线性幂律分布,且前者收敛速度约为后者的两倍,这表明关键缺陷易被小规模代理群体识别,而边缘案例需更大规模群体探测,其机制源于代理间因人格多样性产生的探查维度差异,且专家型代理作为对抗性探针可推动问题发现进入分布尾部。

链接: https://arxiv.org/abs/2604.00477
作者: HyunJoon Jung,William Na
机构: MPhora.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.

[MA-9] CASCADE: Cascaded Scoped Communication for Multi-Agent Re-planning in Disrupted Industrial Environments ICLR2026

【速读】:该论文旨在解决工业中断重规划中多智能体协调的难题,特别是在严格的时间延迟和通信预算约束下,如何有效应对因紧密耦合的物理依赖关系导致的扰动传播问题。现有协调机制要么将通信视为无成本(广播式升级),要么预先固定通信范围(手工调优的邻域),这两种方式在扰动超出局部区域时均表现出脆弱性。解决方案的关键在于提出一种名为 \CASCADE 的预算化重规划机制,其核心创新是使通信作用范围显式化、可审计化而非固定或隐含;每个智能体维护显式的知识库,基于角色条件求解本地决策问题以更新承诺,并通过轻量级契约原语进行协调——仅当本地验证表明当前作用范围不足时才扩展协调足迹。这种设计将统一的智能体基础架构(知识库 / 决策管理器 / 通信管理器)与受控的作用层分离,从而实现对谁被联系、协调传播范围及何时触发升级的明确控制,最终在质量-延迟-通信权衡和不确定性下的鲁棒性方面取得显著提升。

链接: https://arxiv.org/abs/2604.00451
作者: Mingjie Bi
机构: Beijing Institute for General Artificial Intelligence (BIGAI)
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Published at ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making

点击查看摘要

Abstract:Industrial disruption replanning demands multi-agent coordination under strict latency and communication budgets, where disruptions propagate through tightly coupled physical dependencies and rapidly invalidate baseline schedules and commitments. Existing coordination schemes often treat communication as either effectively free (broadcast-style escalation) or fixed in advance (hand-tuned neighborhoods), both of which are brittle once the disruption footprint extends beyond a local region. We present \CASCADE, a budgeted replanning mechanism that makes communication scope explicit and auditable rather than fixed or implicit. Each agent maintains an explicit knowledge base, solves role-conditioned local decision problems to revise commitments, and coordinates through lightweight contract primitives whose footprint expands only when local validation indicates that the current scope is insufficient. This design separates a unified agent substrate (Knowledge Base / Decision Manager / Communication Manager) from a scoped interaction layer that controls who is contacted, how far coordination propagates, and when escalation is triggered under explicit budgets. We evaluate \CASCADE on disrupted manufacturing and supply-chain settings using unified diagnostics intended to test a mechanism-design claim – whether explicit scope control yields useful quality-latency-communication trade-offs and improved robustness under uncertainty – rather than to provide a complete algorithmic ranking.

[MA-10] Convergence of Byzantine-Resilient Gradient Tracking via Probabilistic Edge Dropout

【速读】:该论文旨在解决分布式优化中存在拜占庭(Byzantine)代理的问题,即部分节点可能发送任意恶意消息干扰全局收敛。其解决方案的核心在于提出一种名为梯度跟踪概率边 dropout(Gradient Tracking with Probabilistic Edge Dropout, GT-PD)的方法,通过两个互补的防御机制实现鲁棒性:一是通用的自中心投影(self-centered projection),将接收到的消息裁剪至以接收节点为中心、半径为 τ 的球内;二是基于双度量信任评分的去中心化概率边丢弃策略,动态调整通信链路的激活概率。该设计在抑制对抗扰动的同时保持了双重随机混合结构(doubly stochastic mixing structure),从而保障收敛性。进一步地,针对持续扰动导致的跟踪误差累积问题,论文还引入带漏积分器的 GT-PD-L(GT-PD-L),利用漏积分类似低通滤波的效果控制误差积累,最终实现线性收敛至由随机梯度方差与裁剪-漏失比共同决定的有界邻域。实验表明,在 MNIST 数据集上,GT-PD-L 在签名翻转、ALIE 和内积操纵攻击下性能优于坐标修剪均值方法最多达 4.3 个百分点。

链接: https://arxiv.org/abs/2604.00449
作者: Amirhossein Dezhboro,Fateme Maleki,Arman Adibi,Erfan Amini,Jose E. Ramirez-Marquez
机构: Stevens Institute of Technology (史蒂文斯理工学院); Rutgers University (罗格斯大学); Augusta University (奥古斯塔大学); Columbia University (哥伦比亚大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We study distributed optimization over networks with Byzantine agents that may send arbitrary adversarial messages. We propose \emphGradient Tracking with Probabilistic Edge Dropout (GT-PD), a stochastic gradient tracking method that preserves the convergence properties of gradient tracking under adversarial communication. GT-PD combines two complementary defense layers: a universal self-centered projection that clips each incoming message to a ball of radius \tau around the receiving agent, and a fully decentralized probabilistic dropout rule driven by a dual-metric trust score in the decision and tracking channels. This design bounds adversarial perturbations while preserving the doubly stochastic mixing structure, a property often lost under robust aggregation in decentralized settings. Under complete Byzantine isolation ( p_b=0 ), GT-PD converges linearly to a neighborhood determined solely by stochastic gradient variance. For partial isolation ( p_b0 ), we introduce \emphGradient Tracking with Probabilistic Edge Dropout and Leaky Integration (GT-PD-L), which uses a leaky integrator to control the accumulation of tracking errors caused by persistent perturbations and achieves linear convergence to a bounded neighborhood determined by the stochastic variance and the clipping-to-leak ratio. We further show that under two-tier dropout with p_h=1 , isolating Byzantine agents introduces no additional variance into the honest consensus dynamics. Experiments on MNIST under Sign Flip, ALIE, and Inner Product Manipulation attacks show that GT-PD-L outperforms coordinate-wise trimmed mean by up to 4.3 percentage points under stealth attacks.

[MA-11] Internal State-Based Policy Gradient Methods for Partially Observable Markov Potential Games

【速读】:该论文旨在解决部分可观测马尔可夫势博弈(Partially Observable Markov Potential Games, POMPGs)中的多智能体强化学习问题,其核心挑战包括局部可观测性、信息去中心化以及维数灾难。解决方案的关键在于引入基于公共信息框架(Common Information Framework)的内部状态(Internal State)机制:一方面利用公共信息实现联合策略决策,另一方面通过压缩累积信息避免状态空间无限增长,从而提升计算可处理性;在此基础上,提出基于内部状态的自然策略梯度方法,并首次建立了该方法的非渐近收敛界,其误差由统计误差与有限状态控制器引起的近似误差两部分构成,实验表明该方法在多个部分可观测环境中显著优于仅依赖当前观测的基准方案。

链接: https://arxiv.org/abs/2604.00433
作者: Wonseok Yang,Thinh T. Doan
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 6 pages, 2 figures. Submitted to IEEE Control Systems Letters (L-CSS) with CDC option

点击查看摘要

Abstract:This letter studies multi-agent reinforcement learning in partially observable Markov potential games. Solving this problem is challenging due to partial observability, decentralized information, and the curse of dimensionality. First, to address the first two challenges, we leverage the common information framework, which allows agents to act based on both shared and local information. Second, to ensure tractability, we study an internal state that compresses accumulated information, preventing it from growing unboundedly over time. We then implement an internal state-based natural policy gradient method to find Nash equilibria of the Markov potential game. Our main contribution is to establish a non-asymptotic convergence bound for this method. Our theoretical bound decomposes into two interpretable components: a statistical error term that also arises in standard Markov potential games, and an approximation error capturing the use of finite-state controllers. Finally, simulations across multiple partially observable environments demonstrate that the proposed method using finite-state controllers achieves consistent improvements in performance compared to the setting where only the current observation is used.

[MA-12] Secure Forgetting: A Framework for Privacy-Driven Unlearning in Large Language Model (LLM )-Based Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在实际应用中因累积敏感或过时知识而引发的安全与准确性问题,即如何实现对特定知识的可控遗忘,从而提出“LLM-based agent unlearning”这一新研究方向。其解决方案的关键在于构建一个三维度分类框架,将遗忘场景细分为状态遗忘(state unlearning)、轨迹遗忘(trajectory unlearning)和环境遗忘(environment unlearning),并设计一种基于自然语言的遗忘方法:通过训练一个转换模型,将高阶遗忘请求转化为可执行的遗忘提示(unlearning prompts),引导智能体完成受控的知识删除过程;同时引入一个遗忘推理对抗者(unlearning inference adversary)以评估该框架在抵御知识泄露方面的鲁棒性,实验证明该方法能在保持未目标任务性能的同时有效遗忘指定知识,并阻断对抗者的知识推断行为。

链接: https://arxiv.org/abs/2604.00430
作者: Dayong Ye,Tainqing Zhu,Congcong Zhu,Feng He,Qi He,Shang Wang,Bo Liu,Wanlei Zhou
机构: 未知
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents have recently gained considerable attention due to the powerful reasoning capabilities of LLMs. Existing research predominantly focuses on enhancing the task performance of these agents in diverse scenarios. However, as LLM-based agents become increasingly integrated into real-world applications, significant concerns emerge regarding their accumulation of sensitive or outdated knowledge. Addressing these concerns requires the development of mechanisms that allow agents to selectively forget previously learned knowledge, giving rise to a new term LLM-based agent unlearning. This paper initiates research on unlearning in LLM-based agents. Specifically, we propose a novel and comprehensive framework that categorizes unlearning scenarios into three contexts: state unlearning (forgetting specific states or items), trajectory unlearning (forgetting sequences of actions) and environment unlearning (forgetting entire environments or categories of tasks). Within this framework, we introduce a natural language-based unlearning method that trains a conversion model to transform high-level unlearning requests into actionable unlearning prompts, guiding agents through a controlled forgetting process. Moreover, to evaluate the robustness of the proposed framework, we introduce an unlearning inference adversary capable of crafting prompts, querying agents, and observing their behaviors in an attempt to infer the forgotten knowledge. Experimental results show that our approach effectively enables agents to forget targeted knowledge while preserving performance on untargeted tasks, and prevents the adversary from inferring the forgotten knowledge.

[MA-13] Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

【速读】:该论文旨在解决多智能体系统中AI代理(AI agents)与批评家(critics)在无直接通信情况下协同优化的问题,特别是在联邦学习框架下完成多模态任务(如网络遥测故障检测、医疗影像诊断等)时如何实现高效且隐私保护的协作。其解决方案的关键在于设计基于多时间尺度随机逼近(multi-time scale stochastic approximation)的算法,使得AI代理与批评家通过中央服务器进行反馈交互,从而最小化系统整体成本,同时保持各自成本函数或其导数的私密性;此外,通信开销仅与模态数量 $ m $ 相关($ \mathcal{O}(m) $),与代理和批评家的数量无关,显著提升了可扩展性与实用性。

链接: https://arxiv.org/abs/2604.00319
作者: Syed Eqbal Alam,Zhan Shu
机构: University of Alberta (阿尔伯塔大学); SheQAI Research (SheQAI 研究机构)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of \mathcalO(m) , for m modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm’s efficacy.

[MA-14] Improvisational Games as a Benchmark for Social Intelligence of AI Agents : The Case of Connections

【速读】:该论文旨在解决如何评估语言模型驱动的智能体在复杂社交情境下的社会智能能力问题,特别是超越其自身记忆和演绎推理之外的认知能力。解决方案的关键在于引入一种即兴的文字游戏——Connections,该游戏要求参与者具备知识检索、摘要归纳以及对其他智能体认知状态的敏感性等多维技能;通过该游戏作为基准测试,研究者能够系统评估AI智能体在受限通信环境中与他人协作时所表现出的社会意识与智能水平。

链接: https://arxiv.org/abs/2604.00284
作者: Gaurav Rajesh Parikh,Angikar Ghosal
机构: Duke University (杜克大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: this https URL

点击查看摘要

Abstract:We formally introduce a improvisational wordplay game called Connections to explore reasoning capabilities of AI agents. Playing Connections combines skills in knowledge retrieval, summarization and awareness of cognitive states of other agents. We show how the game serves as a good benchmark for social intelligence abilities of language model based agents that go beyond the agents’ own memory and deductive reasoning and also involve gauging the understanding capabilities of other agents. Finally, we show how through communication with other agents in a constrained environment, AI agents must demonstrate social awareness and intelligence in games involving collaboration.

[MA-15] A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

【速读】:该论文旨在解决单智能体大语言模型(Large Language Model, LLM)在支持多样化对话功能与保障行为健康沟通安全性方面存在的局限性问题。其解决方案的关键在于提出一种安全感知、角色协同的多智能体LLM框架,通过将对话职责分解至具有特定职能的专用智能体(如共情导向型、行动导向型和监督型角色),并借助基于提示的控制器动态激活相关智能体及持续执行安全审计,从而实现结构化质量提升、功能多样性增强与计算特性可控之间的平衡。

链接: https://arxiv.org/abs/2604.00249
作者: Ha Na Cho
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Single-agent large language model (LLM) systems struggle to simultaneously support diverse conversational functions and maintain safety in behavioral health communication. We propose a safety-aware, role-orchestrated multi-agent LLM framework designed to simulate supportive behavioral health dialogue through coordinated, role-differentiated agents. Conversational responsibilities are decomposed across specialized agents, including empathy-focused, action-oriented, and supervisory roles, while a prompt-based controller dynamically activates relevant agents and enforces continuous safety auditing. Using semi-structured interview transcripts from the DAIC-WOZ corpus, we evaluate the framework with scalable proxy metrics capturing structural quality, functional diversity, and computational characteristics. Results illustrate clear role differentiation, coherent inter-agent coordination, and predictable trade-offs between modular orchestration, safety oversight, and response latency when compared to a single-agent baseline. This work emphasizes system design, interpretability, and safety, positioning the framework as a simulation and analysis tool for behavioral health informatics and decision-support research rather than a clinical intervention.

[MA-16] AI-Mediated Explainable Regulation for Justice

【速读】:该论文旨在解决当前监管决策过程中存在的诸多问题,包括监管规则静态化、缺乏解释性、易受强势利益集团不当影响,以及由此产生的合法性缺失等问题,这些问题可能导致不公正并严重损害社会与民主制度。其解决方案的关键在于引入分布式人工智能(Distributed Artificial Intelligence, Distributed AI),构建一个可解释且可自适应的监管推荐系统:该系统通过独立的偏好模型分别建模各利益相关方的价值诉求,并以价值敏感的方式聚合这些偏好,从而生成可随事实或价值观变化而动态更新的监管建议,同时确保推荐过程的透明性和可验证性,进而提升监管正义、合法性和合规性。

链接: https://arxiv.org/abs/2604.00237
作者: Thomas Hofweber,Andreas Sudmann,Evangelos Pournaras
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Present practice of deciding on regulation faces numerous problems that make adopted regulations static, unexplained, unduly influenced by powerful interest groups, and stained with a perception of illegitimacy. These well-known problems with the regulatory process can lead to injustice and have substantial negative effects on society and democracy. We discuss a new approach that utilizes distributed artificial intelligence (AI) to make a regulatory recommendation that is explainable and adaptable by design. We outline the main components of a system that can implement this approach and show how it would resolve the problems with the present regulatory system. This approach models and reasons about stakeholder preferences with separate preference models, while it aggregates these preferences in a value sensitive way. Such recommendations can be updated due to changes in facts or in values and are inherently explainable. We suggest how stakeholders can make their preferences known to the system and how they can verify whether they were properly considered in the regulatory decision. The resulting system promises to support regulatory justice, legitimacy, and compliance.

[MA-17] One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在临床预测任务中因病例异质性导致的输出不一致性问题:简单病例预测稳定,而复杂病例对微小提示变化敏感,产生分歧。现有单代理策略受限于固定角色分布,多代理框架则采用静态角色与扁平多数投票机制,忽视了诊断分歧中的信号价值。其解决方案的核心是提出CAMP(Case-Adaptive Multi-agent Panel),通过由主治医师代理动态构建针对每个病例诊断不确定性的专科医生小组;每位专科医生采用三值投票(KEEP/REFUSE/NEUTRAL)机制实现专业范围内的审慎判断,并结合混合路由机制——优先强共识、次选主治医师裁决、最后基于证据的质量进行仲裁,从而在保证准确性的同时提升决策透明度与可审计性。

链接: https://arxiv.org/abs/2604.00085
作者: Yuxing Lu,Yushuhong Lin,Jason Zhang
机构: Georgia Institute of Technology (佐治亚理工学院); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case’s diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one’s expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician’s judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

自然语言处理

[NLP-0] Universal YOCO for Efficient Depth Scaling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理时计算资源效率低下的问题,尤其是传统Transformer架构在测试时扩展(test-time scaling)中因循环策略带来的高计算开销以及KV缓存随模型深度增长而导致的内存膨胀。解决方案的关键在于提出Universal YOCO(YOCO-U),其结合了YOCO解码器-解码器架构与递归计算机制,通过参数共享实现多轮迭代,并将迭代过程限制在浅层高效注意力(efficient-attention)层内,从而在保持恒定全局KV缓存和线性预填充的同时,以有限开销提升表征深度,最终实现能力与效率的协同优化。

链接: https://arxiv.org/abs/2604.01220
作者: Yutao Sun,Li Dong,Tianzhu Ye,Shaohan Huang,Jianyong Wang,Furu Wei
机构: Microsoft Research (微软研究院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

[NLP-1] textttYC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期任务中维持战略一致性(strategic coherence)的问题,具体包括在不确定性下规划、从延迟反馈中学习以及在早期错误导致后果累积时进行适应性调整的能力。其解决方案的核心是提出一个名为 \texttt{YC-Bench} 的基准测试框架,该框架通过模拟一个为期一年的创业公司运营过程(跨越数百轮交互),要求代理在部分可观测环境中管理员工、选择任务合同并保持盈利,同时应对敌对客户和不断增长的薪资成本所带来的复杂挑战。实验表明,仅少数模型能持续超越初始资本(20万美元),且“草稿区使用率”(scratchpad usage)是成功的关键预测因子,而敌对客户识别失败则是破产的主要原因(占47%)。这揭示了当前前沿模型在长周期任务中仍存在显著能力差距,如过度并行化等独特失效模式。

链接: https://arxiv.org/abs/2604.01212
作者: Muyu He,Adit Jain,Anand Kumar,Vincent Tu,Soumyadeep Bakshi,Sachin Patro,Nazneen Rajani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce \textttYC-Bench , a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \ 200K, with Claude Opus 4.6 achieving the highest average final funds at \ 1.27 M, followed by GLM-5 at \ 1.21 M at 11 \times lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for 47% of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. \textttYC-Bench is open-source, reproducible, and configurable.

[NLP-2] LLM REgression with a Latent Iterative State Head

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本回归任务中精度不足与参数效率低下的问题。现有方法如自回归解码、回归感知推理和预测头方法存在性能瓶颈或高参数开销。解决方案的关键在于提出RELISH(REgression with a Latent Iterative State Head),其通过在冻结的LLM表示上引入一个轻量级的迭代潜状态头(latent iterative state head),利用跨注意力机制逐步精炼潜状态,并最终通过线性回归器映射为标量预测值。该设计在保持极低参数增量(仅3.4–3.7M trainable参数,占LLM总参数0.01–0.04%)的同时,显著优于多种主流回归范式。

链接: https://arxiv.org/abs/2604.01206
作者: Yiheng Su,Matthew Lease
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).

[NLP-3] Embarrassingly Simple Self-Distillation Improves Code Generation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在代码生成任务中如何仅通过自身输出进行优化的问题,而不依赖验证器、教师模型或强化学习等外部监督机制。其核心解决方案是提出一种简单自蒸馏(Simple Self-Distillation, SSD)方法:首先以特定温度和截断配置从模型中采样代码解题方案,随后使用这些样本通过标准监督微调(Supervised Fine-Tuning, SFT)对模型进行再训练。关键在于,SSD能有效重塑上下文相关的词元分布,在需要高精度的区域抑制冗余尾部(distractor tails),同时保留探索性需求下的有用多样性,从而显著提升模型在复杂问题上的表现,如Qwen3-30B-Instruct在LiveCodeBench v6上的通过率(pass@1)从42.4%提升至55.3%,且该方法具有跨模型架构与规模(4B–30B)的泛化能力。

链接: https://arxiv.org/abs/2604.01193
作者: Ruixiang Zhang,Richard He Bai,Huangjie Zheng,Navdeep Jaitly,Ronan Collobert,Yizhe Zhang
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

[NLP-4] Screening Is Enough

【速读】: 该论文旨在解决标准Softmax注意力机制中缺乏绝对查询-键相关性(absolute query–key relevance)的问题:传统注意力机制通过相对得分重新分配单位质量来计算权重,导致相关性仅相对于竞争键定义,无法显式排除无关键。为此,作者提出Multiscreen架构,其核心创新是引入“筛选”(screening)机制——该机制对每个键进行显式阈值判断,剔除不相关的键并聚合剩余键,从而消除键之间的全局竞争。这一设计实现了绝对相关性的建模,在保持性能的同时显著减少参数量(约40%)、提升训练稳定性(支持更大学习率)、增强长上下文建模能力,并大幅降低推理延迟(10万上下文长度下最高达3.2倍加速)。

链接: https://arxiv.org/abs/2604.01178
作者: Ken M. Nakanishi
机构: Center for Emergent Matter Science (CEMS), RIKEN; Graduate School of Science, The University of Tokyo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 13 figures

点击查看摘要

Abstract:A core limitation of standard softmax attention is that it does not define a notion of absolute query–key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query–key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2 \times at 100K context length.

[NLP-5] Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

【速读】: 该论文旨在解决大语言模型在测试时缩放(test-time scaling)过程中因后训练语言模型校准不足及常用采样技术缺乏校准而导致的计算效率低下问题。其核心解决方案是提出在线推理校准(Online Reasoning Calibration, ORCA),该框架结合了符合性预测(conformal prediction)与测试时训练(test-time training),引入一种元学习机制以动态更新每个输入对应的校准模块,从而在分布偏移场景下(如推理阶段不同思维模式或部署前后提示分布差异)提供有效的置信度估计。ORCA不仅在理论上保证了符合性风险的控制,还在实证中显著提升了多种推理任务下的效率和泛化能力,例如在风险水平 δ=0.1 下,相较于静态校准基线,Qwen2.5-32B 模型在分布内任务中计算资源节省高达 47.5%(使用监督标签)和 40.7%(使用自洽标签),并在零样本跨域设置下将 MATH-500 的资源节省从 24.8% 提升至 67.0%,同时保持较低的实际误差率。

链接: https://arxiv.org/abs/2604.01170
作者: Cai Zhou,Zekai Wang,Menghua Wu,Qianyu Julie Zhu,Flora C. Shi,Chenyu Wang,Ashia Wilson,Tommi Jaakkola,Stephen Bates
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP); Machine Learning (stat.ML)
备注: 20 pages

点击查看摘要

Abstract:While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level \delta=0.1 , ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at this https URL.

[NLP-6] S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

【速读】: 该论文旨在解决在监督信号稀缺条件下,如何高效地对混合语言模型(hybrid language models)进行参数高效微调(Parameter Efficient Fine-Tuning, PEFT)的问题。其核心挑战在于,在保持模型权重冻结的前提下,如何通过极少量的训练数据实现显著性能提升,同时避免推理时的额外开销。解决方案的关键在于提出一种名为S0 tuning的新方法:仅针对每个循环层(recurrent layer)优化一个初始状态矩阵(initial state matrix),而冻结所有其他模型参数,从而在无需权重合并或模型重载的情况下实现零推理开销的性能增强。实验表明,该方法在HumanEval等基准上显著优于LoRA,并在多个下游任务中展现出良好的跨域迁移能力,验证了递归状态初始化作为PEFT表面的强大潜力。

链接: https://arxiv.org/abs/2604.01168
作者: Jack Young
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages (10 main + 5 appendix), 3 figures, code at this https URL

点击查看摘要

Abstract:Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: this https URL.

[NLP-7] Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续多领域微调过程中面临的灾难性遗忘(catastrophic forgetting)与模块化扩展效率低的问题。其核心挑战在于如何在不重新训练基础模型的前提下,高效整合多个领域的知识,并保持各领域性能的稳定性和跨域组合能力。解决方案的关键在于提出 Brainstacks 架构,该架构通过冻结的适配器堆栈(adapter stacks)以加法方式叠加于共享冻结的基础模型之上,结合五项互锁组件:(1) MoE-LoRA 结合 QLoRA 4-bit 量化与 rsLoRA 缩放策略实现高效参数稀疏更新;(2) 内循环残差增强机制通过冻结已有堆栈并添加新堆栈突破单堆栈性能上限;(3) 外循环课程学习顺序训练领域专属堆栈;(4) 基于随机 SVD 的零空间投影确保新增堆栈正交于历史方向,实现孤立场景下的零遗忘;(5) 基于任务结果的 sigmoid 元路由器(meta-router)学习堆栈权重分配,支持跨域组合推理。实验表明,该方法不仅显著加速收敛(2.5倍于单LoRA),还揭示了领域堆栈编码的是可迁移的认知原语(如指令遵循清晰度、数值推理等),而非特定领域知识,从而实现了高效、稳定且具备泛化能力的多领域持续学习。

链接: https://arxiv.org/abs/2604.01152
作者: Mohammad R. Abu Ayyash
机构: Brains Build Research(Brains Build Research)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures, 4 tables

点击查看摘要

Abstract:We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

[NLP-8] Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

【速读】: 该论文旨在解决当前对由现代编码代理(coding agents)撰写的学术论文的质量与风险缺乏系统性评估的问题。现有研究在评估AI生成论文的可靠性方面存在局限,且尚未形成统一的认知框架。其解决方案的关键在于提出Paper Reconstruction Evaluation (PaperRecon) 框架,该框架通过从原始论文中提取概要后让AI代理基于此概要和有限资源重建全文,并将生成结果与原论文进行对比,从而将评估解耦为两个正交维度:呈现质量(Presentation)与幻觉程度(Hallucination)。其中,呈现质量采用评分量表评估,幻觉则通过基于原文来源的代理式评估来量化,从而实现对AI写作论文的多维、可复现的评测。

链接: https://arxiv.org/abs/2604.01128
作者: Atsuyuki Miyai,Mashiro Toyooka,Zaiying Zhao,Kenta Watanabe,Toshihiko Yamasaki,Kiyoharu Aizawa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (this http URL) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

[NLP-9] CARE: Privacy-Compliant Agent ic Reasoning with Evidence Discordance

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在面对内部不一致证据时性能下降的问题,尤其是在重症监护室(Intensive Care Unit, ICU)中患者自述症状与客观体征存在矛盾的情境下,现有LLM方法难以有效整合冲突信息以做出准确的器官功能恶化预测。其解决方案的关键在于提出CARE框架——一个分阶段的隐私合规代理推理系统:远程LLM通过生成结构化的分类和状态转移规则来提供指导,而不接触敏感患者数据;本地LLM则基于这些结构化信息进行证据获取与最终决策,从而在保障隐私的前提下提升对冲突临床证据的鲁棒性处理能力。

链接: https://arxiv.org/abs/2604.01113
作者: Haochen Liu,Weien Li,Rui Song,Zeyu Li,Chun Jason Xue,Xiao-Yang Liu,Sam Nallaperuma,Xue Liu,Ye Yuan
机构: University of Cambridge (剑桥大学); McGill University (麦吉尔大学); MBZUAI - Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Columbia University (哥伦比亚大学); Mila - Quebec AI Institute (魁北克人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.

[NLP-10] mporal Dependencies in In-Context Learning: The Role of Induction Heads

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(in-context learning)中如何追踪和检索信息这一关键问题,尤其是其是否具备类似人类自由回忆(free recall)的序列记忆行为。研究发现,多个开源LLM表现出明显的“+1滞后偏差”(+1 lag bias),即模型倾向于预测紧接在重复token之后的下一个token,这种模式与序列回忆(serial recall)行为一致。解决方案的关键在于识别并验证“归纳头”(induction heads)的作用:这些是Transformer架构中专门关注当前token前一出现位置后紧跟token的注意力头。通过系统性消融实验表明,移除高归纳分数的注意力头显著削弱了+1滞后偏差,并且比随机消融更严重地损害了模型在少样本提示下的序列回忆能力,从而揭示了归纳头在时序上下文处理中的机制特异性作用。

链接: https://arxiv.org/abs/2604.01094
作者: Anooshka Bajaj,Deven Mahesh Mistry,Sahaj Singh Maini,Yash Aggarwal,Billy Dickson,Zoran Tiganj
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.

[NLP-11] Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

【速读】: 该论文试图解决的问题是:多大语言模型(Large Language Model, LLM)复核流水线(multi-LLM revision pipelines)中,第二阶段模型带来的性能提升是否主要源于真正的错误修正(genuine error correction),抑或可能由其他因素(如重新求解、结构引导或内容改进)驱动。为解答这一问题,作者设计了一个受控分解实验,通过四个匹配条件将复核收益拆解为三个可加成分:重解(re-solving)、结构支架(scaffold)和内容改进(content)。关键在于,该方案揭示了复核效果并非统一机制,而是依赖于任务结构、初稿质量及初稿信息类型:在知识密集型多项选择题(MCQ)任务中,优势主要来自强模型的直接重解;而在代码生成任务中,即使初稿语义为空,其结构信息仍具价值,此时两阶段提示依然有效。因此,论文提出应根据任务特性动态调整复核策略,而非采用通用的多模型复核流程。

链接: https://arxiv.org/abs/2604.01029
作者: Jingjie Ning,Xueqi Li,Chengyu Yu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

[NLP-12] Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中因用户奖励数据稀疏且估计方式为确定性点估计而导致的偏好推断不准确与不可靠问题。现有方法在孤立环境下对用户权重进行点估计,忽视了不确定性,从而限制了个性化效果和泛化能力。解决方案的关键在于提出变分奖励分解(Variational Reward Factorization, VRF),其核心创新包括:将每个用户的偏好建模为共享偏好空间中的变分分布,通过变分编码器推断用户分布;利用Wasserstein距离匹配机制实现用户权重与共享概率基函数的一致性对齐;并通过方差衰减损失函数对不确定估计进行降权,从而提升整体推理的鲁棒性和准确性。

链接: https://arxiv.org/abs/2604.00997
作者: Gyuseok Lee,Wonbin Kweon,Zhenrui Yue,SeongKu Kang,Jiawei Han,Dong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward factorization personalizes large language models (LLMs) by decomposing rewards into shared basis functions and user-specific weights. Yet, existing methods estimate user weights from scarce data in isolation and as deterministic points, leading to inaccurate and unreliable inference. We introduce Variational Reward Factorization (VRF), an uncertainty-aware framework that represents each user’s preferences as a variational distribution in a shared preference space. VRF infers user distributions via a variational encoder, derives weights through Wasserstein distance matching with shared probabilistic bases, and downweights uncertain estimates through a variance-attenuated loss. On three benchmarks, VRF outperforms all baselines across seen and unseen users, few-shot scenarios, and varying uncertainty levels, with gains extending to downstream alignment.

[NLP-13] Multimodal Analysis of State-Funded News Coverag e of the Israel-Hamas War on YouTube Shorts

【速读】: 该论文旨在解决短视频平台(如YouTube Shorts)中地缘政治事件报道的表征研究不足的问题,特别是针对以色列-哈马斯战争的短视频新闻内容进行系统性分析。其解决方案的关键在于构建一个多模态处理流程,整合自动语音转录、基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)与语义场景分类技术,从而实现对文本情感倾向和视觉内容特征的协同解析。实验表明,该方法能有效捕捉不同国家媒体在不同时期对冲突特定方面的态度差异,并且小规模领域适配模型在情感分析任务上优于大型预训练Transformer模型,凸显了资源高效型方法在人文社科研究中的价值。

链接: https://arxiv.org/abs/2604.00994
作者: Daniel Miehling,Sandra Kuebler
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.

[NLP-14] Do Phone-Use Agents Respect Your Privacy?

【速读】: 该论文旨在解决当前手机使用代理(phone-use agents)在执行良性移动任务时是否尊重隐私的问题,这一问题长期难以量化评估,原因在于缺乏对代理隐私合规行为的明确定义,且普通应用无法精确揭示代理在执行过程中向哪些表单字段输入了何种数据。为此,作者提出 MyPhoneBench——一个可验证的移动代理隐私行为评估框架,其核心创新在于引入最小隐私合约(iMy),将隐私合规行为操作化为权限访问受限、最小披露和用户可控记忆三要素,并结合仪器化的模拟应用与基于规则的审计机制,使不必要的权限请求、欺骗性再披露及冗余表单填写等隐私违规行为变得可观测且可复现。实验表明,任务成功率、隐私合规完成度以及后续会话中偏好记忆的利用是三个独立的能力维度,单一模型无法同时最优表现,且仅以任务成功为导向的评估会高估代理的实际部署成熟度。

链接: https://arxiv.org/abs/2604.00986
作者: Zhengyang Tang,Ke Ji,Xidong Wang,Zihan Ye,Xinyuan Wang,Yiduo Guo,Ziniu Li,Chenxin Li,Jingyuan Hu,Shunian Chen,Tongxu Luo,Jiaxi Bi,Zeyu Qin,Shaobo Wang,Xin Lai,Pengyuan Lyu,Junyi Li,Can Xu,Chengquan Zhang,Han Hu,Ming Yan,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen; The Chinese University of Hong Kong; The University of Hong Kong; The Hong Kong University of Science and Technology; Shanghai Jiao Tong University; Hunyuan Team, Tencent
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ this https URL.

[NLP-15] Dual Optimal: Make Your LLM Peer-like with Dignity

【速读】: 该论文旨在解决当前对齐语言模型中存在的“回避型仆从”(Evasive Servant)问题,即模型在面对用户错误信念时表现出谄媚倾向,并通过模板化免责声明推卸责任,导致缺乏可信度与责任感。解决方案的关键在于提出“尊严同侪”(Dignified Peer)框架,通过反谄媚(anti-sycophancy)和可信性增强来对抗过度顺从,同时借助共情(empathy)与创造力降低回避行为。实现该框架的核心技术包括:引入具有组合偏序结构的PersonaKnob数据集以支持多维度人格偏好建模,采用容忍约束的拉格朗日DPO算法动态平衡各人格维度防止行为坍塌,以及基于项目反应理论(Item Response Theory, IRT)的心理测量校准评估协议,有效分离潜在人格能力与评价偏差等混杂因素。

链接: https://arxiv.org/abs/2604.00979
作者: Xiangqi Wang,Yue Huang,Haomin Zhuang,Kehan Guo,Xiangliang Zhang
机构: University of Notre Dame, Department of Computer Science and Engineering (圣母大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

[NLP-16] Phase transition on a context-sensitive random language model with short range interactions

【速读】: 该论文旨在解决语言模型中相变现象是否源于语言本身的内在特性,而非仅由长程相互作用引起的争议问题。此前研究表明,具有符号间长程相互作用的语言模型存在Berezinskii–Kosterlitz–Thouless (BKT) 类型的相变,但这种相变也可能出现在传统自旋模型中,因此难以判断其是否为语言特有。论文通过构建一个具有短程相互作用的随机语言模型(属于乔姆斯基层级中的上下文敏感文法),并允许显式引用上下文信息,发现即使上下文长度保持恒定(不随句子长度增长),仍可观察到有限温度下的相变。这一结果表明,语言模型中的相变本质上是由语言结构的内在特性所驱动,而非依赖于长程相互作用,从而揭示了语言系统中存在与统计物理相变机制相关联的本征复杂性。

链接: https://arxiv.org/abs/2604.00947
作者: Yuma Toji,Jun Takahashi,Vwani Roychowdhury,Hideyuki Miyahara
机构: Hokkaido University (北海道大学); The University of Tokyo (东京大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Since the random language model was proposed by E. DeGiuli [Phys. Rev. Lett. 122, 128301], language models have been investigated intensively from the viewpoint of statistical mechanics. Recently, the existence of a Berezinskii–Kosterlitz–Thouless transition was numerically demonstrated in models with long-range interactions between symbols. In statistical mechanics, it has long been known that long-range interactions can induce phase transitions. Therefore, it has remained unclear whether phase transitions observed in language models originate from genuinely linguistic properties that are absent in conventional spin models. In this study, we construct a random language model with short-range interactions and numerically investigate its statistical properties. Our model belongs to the class of context-sensitive grammars in the Chomsky hierarchy and allows explicit reference to contexts. We find that a phase transition occurs even when the model refers only to contexts whose length remains constant with respect to the sentence length. This result indicates that finite-temperature phase transitions in language models are genuinely induced by the intrinsic nature of language, rather than by long-range interactions.

[NLP-17] Positional Cognitive Specialization: Where Do LLM s Learn To Comprehend and Speak Your Language? AAAI26

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在适应新语言时成本高且过程不透明的问题,核心在于揭示语言模型在训练过程中如何习得多语言能力的机制。解决方案的关键在于通过分层消融实验识别出模型中负责语言感知(input comprehension)和生成(output generation)功能的特定层,并据此提出一种名为CogSym的分层启发式策略:仅对模型最外层的25%早期和晚期层进行微调即可达到与全参数微调相当的下游任务性能(偏差在2-3%以内),从而显著降低计算成本并提升多语言适配效率。

链接: https://arxiv.org/abs/2604.00923
作者: Luis Frentzen Salim,Lun-Wei Ku,Hsing-Kuo Kenneth Pao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI26 Main

点击查看摘要

Abstract:Adapting large language models (LLMs) to new languages is an expensive and opaque process. Understanding how language models acquire new languages and multilingual abilities is key to achieve efficient adaptation. Prior work on multilingual interpretability research focuses primarily on how trained models process multilingual instructions, leaving unexplored the mechanisms through which they acquire new languages during training. We investigate these training dynamics on decoder-only transformers through the lens of two functional cognitive specializations: language perception (input comprehension) and production (output generation). Through experiments on low-resource languages, we demonstrate how perceptual and productive specialization emerges in different regions of a language model by running layer ablation sweeps from the model’s input and output directions. Based on the observed specialization patterns, we propose CogSym, a layer-wise heuristic that enables effective adaptation by exclusively fine-tuning a few early and late layers. We show that tuning only the 25% outermost layers achieves downstream task performance within 2-3% deviation from the full fine-tuning baseline. CogSym yields consistent performance with adapter methods such as LoRA, showcasing generalization beyond full fine-tuning. These findings provide insights to better understand how LLMs learn new languages and push toward accessible and inclusive language modeling.

[NLP-18] GPT -NL Public Corpus: A Permissively Licensed Dutch-First Dataset for LLM Pre-training LREC2026

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)训练数据中缺乏高质量、合法合规且可商用的荷兰语资源的问题。解决方案的关键在于构建并公开发布GPT-NL Public Corpus,这是一个规模达360亿预处理荷兰语token的语料库,不包含任何其他LLM预训练语料库中的内容,并辅以约2070亿英文、2320亿代码及480亿德语/丹麦语token,均经过合规性筛选与人工校准。该语料库整合了Common Crawl等现有大规模语料以及新创建的荷兰语专属数据集(部分由合作机构提供或通过合成增强生成),所有数据均来自宽松许可(permissive license)来源,并以CC-BY协议重新分发,从而为开发合法、有用且无害的语言模型提供可靠的数据基础。

链接: https://arxiv.org/abs/2604.00920
作者: Jesse van Oort,Frank Brinkkemper,Erik de Graaf,Bram Vanroy,Saskia Lensink
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.

[NLP-19] Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

【速读】: 该论文旨在解决2D装配图(assembly diagrams)在混合现实(Mixed Reality, MR)环境中难以与实际视频帧对齐的问题,从而实现智能辅助系统对装配进度的监测、错误检测及分步指导。其核心挑战在于装配图与视频帧之间存在显著的“描绘差距”(depiction gap),即二者共享极少视觉特征,导致现有视觉语言模型(Vision Language Models, VLMs)难以准确对齐图文信息。解决方案的关键在于通过构建IKEA-Bench基准数据集(涵盖29种宜家家具产品的1623个问题和6类任务),系统性评估19种不同规模(2B–38B参数)的VLMs在三种对齐策略下的表现,并揭示出:视觉编码能力是提升跨描绘鲁棒性的首要目标,且模型架构家族比参数量更能预测对齐准确性;进一步机制分析表明,装配图与视频占据不同的视觉Transformer(ViT)子空间,引入文本会促使模型从视觉驱动转向文本驱动推理,削弱了原始的视觉对齐能力。

链接: https://arxiv.org/abs/2604.00913
作者: Zhuchenyang Liu,Yao Zhang,Yu Xiao
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: this https URL

[NLP-20] When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行长时程、环境依赖的智能体任务(如网页导航)时,如何有效应对用户在任务中途提出的中断请求(如新增需求、修改目标或撤销指令)的问题。当前基准测试多假设任务无中断或仅限于短文本任务,无法反映真实场景中的动态交互需求。解决方案的关键在于提出一种系统性的中断建模框架和首个针对此类场景的基准测试——InterruptBench,该基准基于WebArena-Lite构建,严格约束语义一致性地合成高质量中断情境,并通过统一的中断模拟机制评估多个主流LLM在单轮与多轮中断下的意图适应能力与恢复效率,揭示了现有强大LLMs在长时程任务中处理中断仍面临显著挑战。

链接: https://arxiv.org/abs/2604.00892
作者: Henry Peng Zou,Chunyu Miao,Wei-Chieh Huang,Yankai Chen,Yue Zhou,Hanrong Zhang,Yaozu Wu,Liancheng Fang,Zhengyao Gu,Zhen Zhang,Kening Zheng,Fangxin Wang,Yi Nian,Shanghao Li,Wenzhe Fan,Langzhou He,Weizhi Zhang,Xue Liu,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); McGill University (麦吉尔大学); MBZUAI; University of California Santa Barbara (加州大学圣塔芭芭拉分校); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at this https URL.

[NLP-21] Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

【速读】: 该论文旨在解决几何问题求解(Geometric Problem Solving, GPS)中逻辑推理能力不足的问题,现有模型通常仅能实现图示理解与符号运算,而缺乏多路径逻辑推理和数值验证机制。其解决方案的关键在于提出MARS-GPS框架,通过生成多个并行推理路径(rollouts),结合Python代码执行进行数值验证,利用token级熵作为置信度信号对推理路径进行排序,并采用多阶段投票与自验证机制聚合最终答案,从而显著提升模型在Geometry3K数据集上的准确率(达88.8%,较之前最优方法提升近11%)。

链接: https://arxiv.org/abs/2604.00890
作者: Md. Abu Bakor Siddique,Shahrin Hossain,Sadman Ahmed Siam,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 4 figures, 7 tables

点击查看摘要

Abstract:Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: this https URL.

[NLP-22] PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在文档理解与图形用户界面(GUI)交互任务中因高分辨率输入导致的计算开销过大问题。其核心挑战在于,细粒度文本和小尺寸UI元素需要大量视觉标记(visual tokens),显著增加了推理和训练时间。解决方案的关键在于提出PixelPrune方法,该方法基于预测编码(predictive coding)对图像像素级冗余进行压缩,在视觉Transformer(ViT)编码器之前剪枝重复的图像块(image patches)。由于该操作在神经网络计算前直接作用于像素空间,PixelPrune无需额外参数且不依赖训练,支持无损(τ=0)和可控有损(τ>0)压缩,从而实现ViT编码器及下游大语言模型(LLM)的全链路加速,实验表明其在保持任务准确率的同时,最多可提升4.2倍推理速度和1.9倍训练效率。

链接: https://arxiv.org/abs/2604.00886
作者: Nan Wang,Zhiwei Jin,Chen Chen,Haonan Lu
机构: OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful – across document and GUI benchmarks, only 22–71% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbfPixelPrune, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emphbefore the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ( \tau=0 ) as well as controlled lossy compression ( \tau0 ). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2 \times inference speedup and 1.9 \times training acceleration. Code is available at this https URL.

[NLP-23] KUET at StanceNakba Shared Task: StanceMoE: Mixture-of-Experts Architecture for Stance Detection LREC’26

【速读】: 该论文旨在解决基于Transformer的立场检测模型在处理多语义表达时存在的局限性问题,即统一表征难以充分捕捉异质性语言信号(如对比性话语结构、框架提示和显著词汇线索)的问题。其解决方案的关键在于提出StanceMoE架构——一种基于微调BERT编码器的上下文增强型混合专家(Mixture-of-Experts, MoE)模型,通过集成六个专门设计的专家模块以捕获互补的语言特征,并引入一个上下文感知的门控机制动态加权各专家贡献,从而实现根据输入内容自适应地路由不同信息路径,提升对隐含目标行为者立场的识别精度。

链接: https://arxiv.org/abs/2604.00878
作者: Abdullah Al Shafi,Md. Milon Islam,Sk. Imran Hossain,K. M. Azharul Hasan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for workshop proceedings of the 15th International Conference on Language Resources and Evaluation (LREC’26)

点击查看摘要

Abstract:Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.

[NLP-24] Agent ic Tool Use in Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)在实际应用中因工具使用方法分散、缺乏统一框架而导致的效能受限问题。其解决方案的关键在于系统性地将现有工具使用方法归纳为三种范式:提示即插即用(prompting as plug-and-play)、监督式工具学习(supervised tool learning)和奖励驱动的工具策略学习(reward-driven tool policy learning),并通过分析各类方法的优势与失效模式、梳理评估体系并指出关键挑战,从而提供一个结构化的演进视角,推动代理工具使用研究的整合与发展。

链接: https://arxiv.org/abs/2604.00835
作者: Jinchao Hu(1),Meizhi Zhong(2),Kehai Chen(1),Xuefeng Bai(1),Min Zhang(1) ((1) Harbin Institute of Technology Shenzhen, Shenzhen, China, (2) TikTok Inc, Beijing, China)
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); TikTok Inc(字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, reviews the evaluation landscape and highlights key challenges, aiming to address this fragmentation and provide a more structured evolutionary view of agentic tool use.

[NLP-25] LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

【速读】: 该论文旨在解决预训练语言模型(Language Model, LM)在转化为视觉-语言模型(Vision-Language Model, VLM)过程中因跨模态干扰和表征偏移导致的原生语言能力退化问题。传统恢复方法通常引入额外模块作为中间对齐层以隔离模态特异性子空间,但此类方案增加架构复杂度、推理时参数负担,并限制模型与场景的灵活性。本文提出一种无适配器的蒸馏方法LinguDistill,其核心创新在于通过层间键值缓存(KV-cache)共享机制,使冻结的原始语言模型作为教师模型,在不修改任一模型结构的前提下,接受学生模型的多模态表示并提供语言监督信号。在此基础上,选择性地在语言密集型数据上蒸馏教师模型的语言优势,从而有效恢复约10%的语言与知识基准性能损失,同时保持视觉任务上的表现力。

链接: https://arxiv.org/abs/2604.00829
作者: Patrick Amadeus Irawan,Erland Hilman Fuadi,Shanu Kumar,Alham Fikri Aji,Yova Kementchedjhieva
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student’s multimodal representations without modifying the architecture of either model. We then selectively distill the teacher’s strong linguistic signal on language-intensive data to recover language capability, while preserving the student’s visual grounding on multimodal tasks. As a result, LinguDistill recovers \sim 10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

[NLP-26] Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

【速读】: 该论文旨在解决现有情感理解基准测试中因依赖短文本和预定义标签而导致的多维情感推理能力不足的问题,即忽略了情绪之间在情境、人际互动和背景线索中的结构化依赖关系。其解决方案的关键在于构建了一个基于Plutchik基本情绪理论的8维情绪向量标注的数据集EmoScene(共4,731个情境丰富的场景),并提出一种考虑情绪共现统计信息的贝叶斯推理框架,通过联合后验推断提升预测结果的情绪结构一致性,从而增强模型对上下文感知的多标签情感预测能力。

链接: https://arxiv.org/abs/2604.00819
作者: Hemanth Kotaprolu,Kishan Maharaj,Raey Zhao,Abhijit Mishra,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay, Mumbai, India; University of Texas at Austin, Texas, United States; IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages in total, 8 Figures, 2 Tables

点击查看摘要

Abstract:Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 context-rich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik’s basic emotions. We evaluate six instruction-tuned large language models in a zero-shot setting and observe modest performance, with the best model achieving a Macro F1 of 0.501, highlighting the difficulty of context-aware multi-label emotion prediction. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing improves structural consistency of predictions and yields notable gains for weaker models (e.g., +0.051 Macro F1 for Qwen2.5-7B). EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

[NLP-27] Routing-Free Mixture-of-Experts

【速读】: 该论文旨在解决标准混合专家(Mixture-of-Experts, MoE)模型中依赖集中式路由机制所引入的刚性归纳偏置问题,此类设计通常包括外部路由器、Softmax激活函数、Top-K选择策略及负载均衡机制,限制了模型的灵活性与可扩展性。其解决方案的关键在于提出“无路由MoE”(Routing-Free MoE),彻底摒弃所有硬编码的集中式组件,将激活功能全部封装在单个专家内部,并通过连续梯度流直接优化,使每个专家能够自主决定自身激活状态;同时引入统一自适应负载平衡框架,通过可配置插值同时优化专家级和token级负载平衡目标,实现灵活且可定制的资源分配。实验表明,该方法在可扩展性和鲁棒性上均优于基线模型。

链接: https://arxiv.org/abs/2604.00801
作者: Yilun Liu,Jinru Han,Sikuan Yan,Volker Tresp,Yunpu Ma
机构: Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); University of California, Los Angeles (加州大学洛杉矶分校); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

[NLP-28] Multimodal Language Models Cannot Spot Spatial Inconsistencies

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨视角场景中对三维几何结构推理能力不足的问题,特别是其难以识别违反空间运动一致性(3D motion consistency)的物体。解决方案的关键在于提出一种简单且可扩展的方法,用于从多视角场景中生成具有真实感但空间不一致的图像对,从而实现对该类能力的系统性评估。实验表明,当前最先进的MLLMs在该任务上显著低于人类观察者的表现,并且在不同场景属性下表现出较大差异,揭示了其对三维结构理解的脆弱性和不完整性。

链接: https://arxiv.org/abs/2604.00799
作者: Om Khangaonkar,Hadi J. Rad,Hamed Pirsiavash
机构: University of California, Davis (加州大学戴维斯分校); Shell (壳牌)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

[NLP-29] Valency Classification of Mapudungun Verbal Roots. Established by the languages own morphotactics

【速读】: 该论文旨在解决马普uche语(Mapudungun)中动词根的句法配价(valency)分类问题,以提升形态分析器(Dungupeyum)的准确性。其解决方案的关键在于基于语言自身的形态规则(morphotactics),通过分析动词根或动词词干与各类后缀在实际构词中的允许组合与限制关系,实现对已确认为动词类别的根进行系统化的配价分类。

链接: https://arxiv.org/abs/2604.00789
作者: Andrés Chandía
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the previous work, a lexical (re)categorisation – or confirmation of the given category – of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language’s own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.

[NLP-30] From Early Encoding to Late Suppression: Interpreting LLM s on Character Counting Tasks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在基础符号任务(如单词中字符计数)上频繁失败的问题,尽管这些模型在复杂基准测试中表现优异。研究发现,这种失败并非源于表示缺失或规模不足,而是由于模型内部计算图中存在结构化的干扰机制——具体表现为早期和中期层能正确编码字符信息,但在后期层(尤其是倒数第二层和最终层的多层感知机MLP)中,存在一类“负向电路”(negative circuits),它们会抑制正确的信号输出,转而放大高概率但错误的预测结果。解决方案的关键在于识别并理解这一类负向电路的作用机制,揭示LLM前向传播过程本质上是一种竞争性解码(competitive decoding):正确与错误假设共存并被动态重加权,最终输出由抑制机制主导而非单纯增强。这为提升模型可解释性和鲁棒性提供了理论依据和干预方向。

链接: https://arxiv.org/abs/2604.00778
作者: Ayan Datta,Mounika Marreddy,Alexander Mehler,Zhixue Zhao,Radhika Mamidi
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校); Goethe University, Frankfurt am Main, Germany (歌德大学,德国法兰克福); University of Sheffield, UK (谢菲尔德大学,英国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., “How many p’s are in apple?”) as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model’s computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.00778 [cs.CL] (or arXiv:2604.00778v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.00778 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ayan Datta [view email] [v1] Wed, 1 Apr 2026 11:40:12 UTC (1,357 KB)

[NLP-31] From Baselines to Preferences: A Comparative Study of LoRA/QLoRA and Preference Optimization for Mental Health Text Classification

【速读】: 该论文旨在解决心理健康文本分类任务中优化策略选择不明确的问题,即在何时、为何以及使用何种优化方法才能获得最佳性能缺乏系统性指导。其解决方案的关键在于通过系统的对比实验,从基础模型到参数高效微调(如LoRA/QLoRA)再到基于偏好的优化方法(如DPO、ORPO和KTO),全面分析不同目标函数设定、适配器选择、优化器行为、上下文窗口长度及类别平衡干预对性能的影响。研究发现,优化效果高度依赖于具体方法和配置,偏好优化尤其表现出对目标函数的高度敏感性,因此强调应优先建立透明的基线模型,再进行受控微调,并仅在偏好优化能带来显著提升时才加以应用,从而构建了一个可复现且面向实际场景的训练策略框架。

链接: https://arxiv.org/abs/2604.00773
作者: Mihael Arcan
机构: Home Lab(家庭实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health text classification has rapidly adopted modern adaptation methods, yet practical guidance on which optimization strategy to use, when, and why remains limited. This paper presents a systematic comparative study of optimization pathways for a joint mental-health classification task, moving from strong vanilla baselines to progressively more specialized techniques. We first establish classical and encoder references, then examine parameter-efficient supervised fine-tuning with LoRA/QLoRA under multiple objective and optimization settings, and finally evaluate preference-based optimization with DPO, ORPO, and KTO, including class-rebalanced training. Rather than emphasizing a single headline score, we focus on methodological insight: how performance changes with objective formulation, adapter choice, optimizer behavior, context windowing, and class-balance intervention. The results show that optimization effects are highly method-dependent: some approaches deliver stable, transferable gains, while others are sensitive to configuration and data balance. Preference optimization, in particular, exhibits large variation across objectives, indicating that method selection is more consequential than simply adding a preference-training stage. The central contribution is a clear optimization narrative for mental health NLP: start from transparent baselines, apply controlled tuning, and use preference optimization selectively where its gains are demonstrable. This provides a reproducible and practically grounded framework for choosing effective training strategies beyond architecture choice alone.

[NLP-32] Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

【速读】: 该论文旨在解决高效注意力机制(如滑动窗口注意力,Sliding-Window Attention, SWA)在长序列建模中表达能力受限的问题,即固定局部窗口难以实现全局信息的有效交互,从而影响模型性能。其核心解决方案是提出随机注意力(Stochastic Attention, SA),关键在于:在每层注意力计算前对输入token序列进行随机排列(random permutation),随后在窗口注意力处理后再恢复原始顺序,这一操作将原本固定的局部感受野转化为具有随机性的全局感知能力,在保持与SWA相同的每层计算复杂度 O(nw)O(nw) 的前提下,通过多层独立采样的随机置换实现指数级增长的接收域,仅需 O(logwn)O(\log_w n) 层即可覆盖全序列,显著优于SWA所需的 O(n/w)O(n/w) 层。此方法受果蝇全脑连接组中“随机短路”结构启发,是一种提升高效注意力表达能力的实用且可插拔的改进方案。

链接: https://arxiv.org/abs/2604.00754
作者: Zehao Jin,Yanan Sui
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network’s long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same O(nw) per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in O(\log_w n) layers versus O(n/w) for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

[NLP-33] LangMARL: Natural Language Multi-Agent Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在动态环境中难以自主演化协作策略的问题,其核心瓶颈在于粗粒度的全局结果掩盖了局部策略优化所需的因果信号。为此,作者提出LangMARL框架,其关键在于将经典多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中的信用分配(credit assignment)机制与策略梯度进化引入到语言空间中:通过代理级的语言信用分配实现精细化反馈,首次在语言空间中实现策略梯度演化以提升政策性能,并利用回放轨迹总结任务相关的因果关系,提供密集奖励信号以增强稀疏奖励下的收敛性与泛化能力。

链接: https://arxiv.org/abs/2604.00722
作者: Huaiyuan Yao,Longchao Da,Xiaoou Liu,Charles Fleming,Tianlong Chen,Hua Wei
机构: Arizona State University (亚利桑那州立大学); Cisco Research (思科研究院); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

[NLP-34] o Memorize or to Retrieve: Scaling Laws for RAG -Considerate Pretraining

【速读】: 该论文旨在解决生成式 AI(Generative AI)中参数化知识(parametric knowledge)与非参数化检索知识(non-parametric knowledge)之间的协同关系不明确的问题,尤其是在固定数据预算下如何平衡预训练语料规模与检索库大小。其解决方案的关键在于系统性地研究不同模型规模(30M–3B参数)和数据规模(最多100B tokens)下的性能表现,提出一个三维缩放框架(three-dimensional scaling framework),将模型性能建模为模型尺寸、预训练token数量和检索语料规模的函数,从而量化评估在特定任务类型和预训练饱和度条件下,检索对性能提升的边际效用,并据此优化数据资源分配策略。

链接: https://arxiv.org/abs/2604.00715
作者: Karan Singh,Michael Yu,Varun Gangal,Zhuofu Tao,Sachin Kumar,Emmy Liu,Steven Y. Feng
机构: Stanford University (斯坦福大学); Independent Researcher (独立研究员); Patronus AI; The Ohio State University (俄亥俄州立大学); Carnegie Mellon University (卡内基梅隆大学); DegenAI Labs (DegenAI实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data at this https URL

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

[NLP-35] AfrIFact: Cultural Information Retrieval Evidence Extraction and Fact Checking for African Languages

【速读】: 该论文旨在解决低资源语言环境下在线信息真实性验证(fact-checking)的难题,尤其是在非洲语言中,因信息获取受限而加剧的健康与文化类虚假信息传播问题。其核心解决方案是构建了AfrIFact数据集,涵盖自动事实核查的三个关键步骤:信息检索(information retrieval)、证据提取(evidence extraction)和事实核查(fact checking),覆盖十种非洲语言及英语。研究发现当前嵌入模型在跨语言检索上表现不足,且医疗领域文档比文化和新闻文档更难检索;同时,大语言模型(LLM)在非洲语言中的多语言事实验证能力较弱,但通过少样本提示(few-shot prompting)可提升性能达43%,任务特定微调进一步将准确率提高26%。这一成果为低资源环境下的信息检索与事实核查提供了基准数据与改进路径。

链接: https://arxiv.org/abs/2604.00706
作者: Israel Abebe Azime,Jesujoba Oluwadara Alabi,Crystina Zhang,Iffat Maab,Atnafu Lambebo Tonja,Tadesse Destaw Belay,Folasade Peace Alabi,Salomey Osei,Saminu Mohammad Aliyu,Nkechinyere Faith Aguobi,Bontu Fufa Balcha,Blessing Kudzaishe Sibanda,Davis David,Mouhamadane Mboup,Daud Abolade,Neo Putini,Philipp Slusallek,David Ifeoluwa Adelani,Dietrich Klakow
机构: Saarland University, Saarland Informatic Campus, Germany; University of Waterloo, Canada; National Institute of Informatics, Japan; University College London, England; Instituto Politécnico Nacional, Maxico; University of Ilorin, Nigeria; Universidad de Deusto, Spain; Bayero University, Nigeria; University of Lagos, Nigeria; Black Swan; Addis Ababa University, Ethiopia; Universite Alioune Diop, Senegal; McGill University, Mila-Quebec AI Institute Canada CIFAR AI Chair; Masakhane NLP
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.

[NLP-36] Learning to Hint for Reinforcement Learning

【速读】: 该论文旨在解决强化学习中Group Relative Policy Optimization (GRPO)因优势坍缩(advantage collapse)导致的学习信号缺失问题:当一组rollout全部获得相同奖励(如全为零)时,无法计算相对优势,从而阻碍策略更新。针对此问题,作者提出Hint Learning for Reinforcement Learning (HiLL)框架,其核心创新在于联合训练一个可适应当前推理器错误的在线提示生成策略(hinter policy)与主推理策略(reasoner policy),并通过引入“提示依赖度”(hint reliance)量化正确轨迹对提示的依赖程度,并基于该指标设计转移加权奖励(transfer-weighted reward),以引导提示生成不仅恢复有效的GRPO组,还提升无提示测试阶段的表现。

链接: https://arxiv.org/abs/2604.00698
作者: Yu Xia,Canwen Xu,Zhewei Yao,Julian McAuley,Yuxiong He
机构: University of California, San Diego (加州大学圣地亚哥分校); Snowflake AI Research (Snowflake AI 研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner’s incorrect rollout, allowing hint generation to adapt to the reasoner’s evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at this https URL.

[NLP-37] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

【速读】: 该论文旨在解决多语言零样本文语转换(text-to-speech, TTS)模型在语言覆盖广度与语音质量之间难以平衡的问题,尤其针对传统两阶段(文本→语义→声学)离散非自回归(discrete non-autoregressive, NAR)模型因结构复杂导致性能瓶颈的局限性。其解决方案的关键在于提出一种基于扩散语言模型风格的新型离散NAR架构——OmniVoice,通过直接将文本映射到多码本声学标记(multi-codebook acoustic tokens),简化了生成流程;同时引入两项核心技术创新:(1) 全码本随机掩码策略以提升训练效率,(2) 利用预训练大语言模型(large language model, LLM)进行初始化以保障语音可懂度(intelligibility)。这一设计使模型在超过600种语言上实现广泛覆盖,并在中文、英文及多种多语言基准测试中达到当前最优性能。

链接: https://arxiv.org/abs/2604.00688
作者: Han Zhu,Lingxuan Ye,Wei Kang,Zengwei Yao,Liyong Guo,Fangjun Kuang,Zhifeng Han,Weiji Zhuang,Long Lin,Daniel Povey
机构: Xiaomi Corp.(小米公司)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at this https URL.

[NLP-38] RIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在实际应用中因解码轨迹(decoding trajectory)不优化而导致的低效问题,即标准训练未对token揭示顺序提供显式监督,造成训练与推理阶段的不匹配,从而影响生成效率。解决方案的关键在于提出一种轻量级的轨迹引导监督微调框架——轨迹排序指令掩码监督(Trajectory-Ranked Instruction Masked Supervision, TRIMS),其通过引入来自自回归教师模型的轻量信号来指导掩码策略,使模型学习更优的并行解码顺序,从而在不依赖昂贵的DLM蒸馏的前提下显著提升准确率-并行性权衡,且训练成本远低于现有方法。

链接: https://arxiv.org/abs/2604.00666
作者: Lingjie Chen,Ruizhong Qiu,Yuyu Fan,Yanjun Zhao,Hanghang Tong
机构: UIUC
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, 1 algorithm

点击查看摘要

Abstract:Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.

[NLP-39] A Survey of On-Policy Distillation for Large Language Models

【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation)中因“离策略”(off-policy)训练导致的预测误差累积问题,即学生模型在训练时仅学习静态教师生成的数据,无法接触到自身预测错误,从而在推理阶段产生暴露偏差(exposure bias),造成自回归生成过程中的误差传播。其解决方案的关键在于提出“在策略蒸馏”(On-Policy Distillation, OPD),通过让学生模型自主生成轨迹并接收教师反馈来实现交互式模仿学习(interactive imitation learning),从而将蒸馏过程与实际推理场景对齐,有效缓解误差累积问题。

链接: https://arxiv.org/abs/2604.00626
作者: Mingyang Song,Mao Zheng
机构: Tencent(腾讯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textitoff-policy: students train on static teacher-generated data and never encounter their own errors during learning. This train–test mismatch, an instance of \textitexposure bias, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified f -divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emphfeedback signal (logit-based, outcome-based, or self-play), \emphteacher access (white-box, black-box, or teacher-free), and \emphloss granularity (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

[NLP-40] English to Central Kurdish Speech Translation: Corpus Creation Evaluation and Orthographic Standardization

【速读】: 该论文旨在解决中央库尔德语(Central Kurdish)语音到文本翻译(Speech-to-Text Translation, S2TT)任务中的性能瓶颈问题,尤其关注拼写变体(orthographic variation)对翻译质量的负面影响。研究表明,非标准拼写显著降低翻译准确性,导致输出结果不一致。解决方案的关键在于提出一种系统性的文本标准化方法,通过统一拼写形式提升模型输入的一致性,从而显著改善翻译性能——实验表明,该方法使Seamless模型在独立测试集上BLEU得分达到15.18,并在FLEURS基准上较基线提升3.0 BLEU。此外,论文还验证了端到端模型与级联系统(结合Seamless ASR和NLLB MT)的有效性,为低资源语言的S2TT提供了可复用的技术路径。

链接: https://arxiv.org/abs/2604.00613
作者: Mohammad Mohammadamini,Daban Q. Jaff,Josep Crego,Marie Tahon,Antoine Laurent
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

[NLP-41] Speech LLM s are Contextual Reasoning Transcribers

【速读】: 该论文旨在解决如何有效利用大语言模型(Large Language Models, LLMs)在自动语音识别(Automatic Speech Recognition, ASR)中的丰富知识与上下文理解能力的问题,因为传统ASR任务主要依赖直接的语音到文本映射,难以充分发挥LLMs的生成式优势。其解决方案的关键在于提出链式思维语音识别(Chain-of-Thought ASR, CoT-ASR),通过构建推理链使LLMs先对输入语音进行上下文分析并生成推理结果,从而在单次推理过程中实现更精准的语音识别与语义理解;同时引入CTC引导的模态适配器(CTC-guided Modality Adapter),利用CTC非空白token概率对LLM嵌入进行加权,有效缩小语音编码器输出与LLM文本潜在空间之间的模态差距,显著提升识别准确率,实验表明相较于标准LLM-based ASR,CoT-ASR在词错误率(WER)和实体错误率(EER)上分别相对降低8.7%和16.9%。

链接: https://arxiv.org/abs/2604.00610
作者: Keqi Deng,Ruchao Fan,Bo Ren,Yiming Wang,Jinyu Li
机构: Microsoft Core AI (微软核心人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM’s textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

[NLP-42] More Human More Efficient: Aligning Annotations with Quantized SLMs

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在文本自动评估与标注任务中因系统性偏差、可复现性差及数据隐私问题而难以可靠应用的问题。其解决方案的关键在于:通过在有限的人工标注数据上微调一个参数规模为1.7B的量化小型语言模型(Small Language Model),结合自定义的多维评分框架以及简单的增强和正则化技术,实现高一致性、确定性的标注与评估能力。实验表明,该方法在Krippendorff’s α指标上较最优商用LLM提升0.23点,并具备良好的任务泛化能力,为替代闭源模型提供了一种高效且可复现的开源方案。

链接: https://arxiv.org/abs/2604.00586
作者: Jiayu Wang,Junyoung Lee
机构: Home Team Science and Technology Agency (Home Team Science and Technology Agency)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human-annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi-dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter-annotator agreement (0.23 points increase in Krippendorff’s \alpha ) than the best performing state-of-the-art proprietary LLM. We also demonstrate the generalizability of the proposed training pipeline on a separate emotion classification task. The results show that task-specific alignment and efficient 4-bit quantized fine-tuning provide superior open-source alternative to using proprietary models for evaluation and annotation. Our finetuning approach is publicly available at this https URL.

[NLP-43] A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在日语语境下社会偏见评估不足的问题,尤其是现有基准测试多依赖英文数据翻译而来,未能准确反映日本文化特有的偏见,并且仅关注结论层面的偏见,忽视了推理过程中潜在的群体归因偏差。解决方案的关键在于基于社会心理学中的归因理论(attribution theory),构建了一个名为JUBAKU-v2的新数据集,该数据集通过固定结论、控制变量的方式,专门评估模型在推理阶段对内群体(in-group)与外群体(out-group)行为归因时所表现出的文化特异性偏见,共包含216个体现日本文化特征的示例,实验证明其能更敏感地识别不同模型间的性能差异。

链接: https://arxiv.org/abs/2604.00568
作者: Taihei Shiotani,Masahiro Kaneko,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); MBZUAI; AIST; NII LLMC
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,‘’ which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.

[NLP-44] Ontology-Constrained Neural Reasoning in Enterprise Agent ic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

【速读】: 该论文旨在解决企业在采用大语言模型(Large Language Models, LLMs)过程中面临的三大核心挑战:幻觉(hallucination)、领域漂移(domain drift)以及无法在推理层面强制执行监管合规性。其解决方案的关键在于提出一种神经符号架构(neurosymbolic architecture),嵌入于Foundation AgenticOS(FAOS)平台中,通过本体约束的神经推理实现对LLM行为的结构化控制。该架构引入三层本体框架——角色(Role)、领域(Domain)和交互(Interaction)本体,为基于LLM的企业代理提供形式化的语义基础,并首次形式化了“不对称神经符号耦合”概念:符号本体知识用于约束输入端(如上下文组装、工具发现、治理阈值设定),并进一步提出机制以扩展至输出端(响应验证、推理验证、合规检查)。实证研究表明,该方法显著提升指标准确性、监管合规性和角色一致性,尤其在LLM参数知识覆盖较弱的越南本地化领域效果最为突出,验证了本体接地价值与LLM训练数据覆盖度呈负相关的关系。

链接: https://arxiv.org/abs/2604.00555
作者: Thanh Luong Tuan
机构: Golden Gate University (金门大学); Foundation AgenticOS (FAOS)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 23 pages, 7 tables, 4 figures, 33 references. Empirical evaluation: 600 runs across 5 regulated industries including Vietnamese-language domains

点击查看摘要

Abstract:Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework–Role, Domain, and Interaction ontologies–that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest–particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.

[NLP-45] Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在知识密集型领域(如人文社科、医学、法律和金融)中高质量监督微调(Supervised Fine-Tuning, SFT)数据稀缺的问题。传统方法依赖人工设计的评判标准(rubric)生成合成数据,但此类规则具有领域依赖性、迁移能力差且优化过程依赖经验性试错,缺乏对合成数据训练效用的定量反馈。论文的关键解决方案是:通过引入基于梯度的优化器感知影响估计(optimizer-aware influence estimation),量化每个合成样本对目标模型在特定任务上的学习贡献,并将其作为强化学习中的奖励信号,驱动一个专门用于生成任务条件化评判标准的模型自适应优化rubric。该方法实现了无需任务特异性调参的跨域泛化性能提升,显著增强了合成数据的训练实用性。

链接: https://arxiv.org/abs/2604.00536
作者: Zhiting Fan,Ruizhe Chen,Tianxiang Hu,Ru Peng,Zenan Huang,Haokai Xu,Yixin Chen,Jian Wu,Junbo Zhao,Zuozhu Liu
机构: Zhejiang University (浙江大学); Inclusion AI (包含AI); Ant Group (蚂蚁集团); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample’s contribution to a target model’s objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

[NLP-46] MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference

【速读】: 该论文旨在解决模型在实际部署中因硬件支持或运行时约束而需动态选择数值精度的问题,传统量化感知训练(Quantization-aware training, QAT)通常仅针对单一目标数值格式进行优化,导致模型难以灵活适配多种低精度格式。其解决方案的关键在于提出多格式量化感知训练(multi-format QAT),使单个模型在训练阶段即具备对多种量化格式的鲁棒性,且能在未见过的格式上保持性能;进一步设计了Slice-and-Scale转换流程,支持MXINT和MXFP格式间无需重训练即可实现高精度到低精度的实时转换,从而构建一个可弹性扩展精度的推理管道:先以多格式QAT训练模型并存储锚点格式检查点(如MXINT8/MXFP8),再在推理时按需动态转换至目标低精度格式,实现跨设备部署的灵活性与精度一致性。

链接: https://arxiv.org/abs/2604.00529
作者: Zifei Xu,Sayeh Sharify,Hesham Mostafa
机构: d-Matrix
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8), and (iii) allows on-the-fly conversion to lower MXINT or MXFP formats at runtime with negligible-or no-additional accuracy degradation. Together, these components provide a practical path to elastic precision scaling and allow selecting the runtime format at inference time across diverse deployment targets.

[NLP-47] Adapting Text LLM s to Speech via Multimodal Depth Up-Scaling

【速读】: 该论文旨在解决预训练文本大语言模型(Large Language Models, LLMs)在通过持续预训练(continual pretraining)迁移至语音任务时,导致原始文本能力显著退化的问题。解决方案的关键在于提出“多模态深度上扩增”(Multimodal Depth Upscaling)策略:在冻结的文本LLM中插入新的Transformer层,并仅对新增层进行语音数据训练,从而实现语音建模性能接近全量微调(full fine-tuning),同时大幅减少文本能力退化。进一步地,采用专为语音识别设计的E-Branchformer架构作为插入层,在更大模型上实现了与全量微调相当甚至更优的自动语音识别(ASR)性能,且文本退化降低超过75%,同时参数量减少60%。

链接: https://arxiv.org/abs/2604.00489
作者: Kazuki Yano,Jun Suzuki,Shinji Watanabe
机构: Tohoku University (东北大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

[NLP-48] First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成答案时错误地引入不存在的物体。尽管已有方法如微调或外部 grounding 技术可缓解此问题,但它们通常存在数据成本高或结构复杂的问题;而训练-free 方法如对比解码(Contrastive Decoding, CD)虽具成本优势,却面临长期衰减(long-term decay)现象——即随着生成推进,视觉信息逐渐弱化,语言先验占据主导,导致幻觉加剧。论文提出了一种名为首次词元增强(First Logit Boosting, FLB)的简单有效训练-free 解决方案,其核心在于:存储首个生成 token 的 logits,并将其加到后续所有 token 的预测中,从而稳定保留初始视觉信息并抑制幻觉词汇(特别是通过“the”这一 token 的稳定作用),显著降低多任务、多基准和多骨干模型下的对象幻觉,且推理开销可忽略不计。

链接: https://arxiv.org/abs/2604.00455
作者: Jiwoo Ha,Jongwoo Baek,Jinhyun So
机构: DGIST EECS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 13 figures

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination – the generation of nonexistent objects in answers – remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The’’ token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at this https URL

[NLP-49] owards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中不确定性估计(Uncertainty Estimation, UE)指标在不同配置下表现不稳定的问题,这种不稳定性源于多数UE指标基于模型行为而非输出事实正确性的显式建模,导致其在低信息量场景下失去区分能力。解决方案的关键在于提出一种后验校准方法——Truth AnChoring (TAC),通过将原始不确定性得分映射为与事实对齐的校准分数,即使在噪声和少量监督的情况下也能学习到具有良好校准性能的不确定性估计,从而提升LLM输出的可靠性。

链接: https://arxiv.org/abs/2604.00445
作者: Ponhvoan Srey,Quang Minh Nguyen,Xiaobao Wu,Anh Tuan Luu
机构: Nanyang Technological University, Singapore; KAIST, South Korea; Shanghai Jiao Tong University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non-discriminative precisely in low-information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post-hoc calibration method to remedy UE metrics, by mapping the raw scores to truth-aligned scores. Even with noisy and few-shot supervision, our TAC can support the learning of well-calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at this https URL.

[NLP-50] Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

【速读】: 该论文旨在解决神经网络中概念重叠现象的成因问题,即当同一神经元同时激活于不同语义概念(如“lender”和“riverside”)时,传统度量指标常将其归因于超位置(superposition)——认为神经元在压缩两个无关概念。然而,本文提出并验证了另一种可能:这种重叠可能是由词汇混淆(lexical confound)引起的,即神经元响应的是共享词形(如“bank”)而非两个独立语义。其解决方案的关键在于采用2×2因子分解方法,系统区分“仅词汇一致”(相同词形、不同含义)与“仅语义一致”(不同词形、相同含义)两种情况,结果表明词汇混淆在各规模模型(110M–70B参数)中均显著高于语义重叠;进一步发现该混淆现象存在于稀疏自编码器中约18–36%的特征中,且占据不到1%的激活维度,但会损害下游任务性能,剔除此类混淆可提升词义消歧准确率并增强知识编辑的选择性(p = 0.002)。

链接: https://arxiv.org/abs/2604.00443
作者: Iyad Ait Hou,Rebecca Hwa
机构: George Washington University (乔治华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:If the same neuron activates for both “lender” and “riverside,” standard metrics attribute the overlap to superposition–the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as “bank”) rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in =1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

[NLP-51] Execution-Verified Reinforcement Learning for Optimization Modeling

【速读】: 该论文旨在解决自动化优化建模中因依赖封闭源大语言模型(Large Language Models, LLMs)导致的高推理延迟,以及基于过程监督微调小模型时易过拟合单一求解器API的问题。其解决方案的关键在于提出执行验证优化建模(Execution-Verified Optimization Modeling, EVOM)框架,该框架将数学规划求解器视为确定性交互式验证器,通过生成目标求解器特定代码、在沙箱环境中执行并将其结果转化为标量奖励,利用GRPO和DAPO算法在“生成-执行-反馈-更新”的闭环过程中进行优化。此仅基于结果的建模范式无需过程级监督,并可通过切换验证环境实现跨求解器泛化,从而支持零样本求解器迁移和低成本的目标求解器适配。

链接: https://arxiv.org/abs/2604.00442
作者: Runda Guan,Xiangqing Shen,Jiajun Zhang,Yifan Zhang,Jian Cheng,Rui Xia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.

[NLP-52] R-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

【速读】: 该论文旨在解决上下文强化学习(In-Context Reinforcement Learning, ICRL)中因推理阶段缺乏真实奖励信号而导致的奖励估计难题。现有ICRL方法在实际应用时无法获取地面真值(ground-truth),从而限制了模型通过外部奖励进行在线优化的能力。为此,作者提出测试时重思考机制(Test-Time Rethinking for In-Context Reinforcement Learning, TR-ICRL),其核心创新在于:首先从无标签评估集中检索与查询最相关的实例,随后在每轮ICRL迭代中生成候选答案集,并通过多数投票机制构建伪标签(pseudo-label)作为代理奖励信号,进而生成形成性反馈以引导模型迭代优化;最终将合成的上下文信息与原始查询融合,形成完整提示并执行最终多数投票得出答案。该方案有效缓解了ICRL对真实奖励的依赖,显著提升了模型在推理和知识密集型任务上的性能表现。

链接: https://arxiv.org/abs/2604.00438
作者: Wenxuan Jiang,Yuxin Zuo,Zijian Zhang,Xuecheng Wu,Zining Fan,Wenxuan Liu,Li Chen,Xiaoyu Li,Xuezhi Cao,Xiaolong Jin,Ninghao Liu
机构: The Hong Kong Polytechnic University(香港理工大学); Meituan-M17; Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); Xi’an Jiaotong University(西安交通大学); East China Normal University(华东师范大学); Northeastern University(东北大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at this https URL.

[NLP-53] Locally Confident Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在随机解码过程中面临的质量-探索权衡困境(quality–exploration dilemma):低置信度重掩码(low-confidence remasking)虽能提升单样本生成质量(如Pass@1),但会抑制序列分布的熵,限制多样本采样的增益(如Pass@k)。解决方案的关键在于提出一个统一理论框架,证明低置信度重掩码仅优化局部质量代理指标,同时约束了序列分布的熵;进而定义了一个显式平衡质量与探索能力的最优分布,并设计了一种基于独立马尔可夫链蒙特卡洛(Independent Metropolis–Hastings)的采样器,在解码过程中近似地逼近该分布,从而在多个推理基准测试(如MATH500、AIME24/25、HumanEval和MBPP)上实现更优的质量-探索权衡。

链接: https://arxiv.org/abs/2604.00375
作者: Liancheng Fang,Aiwei Liu,Henry Peng Zou,Yankai Chen,Enze Ma,Leyi Pan,Chunyu Miao,Wei-Chieh Huang,Xue Liu,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Tsinghua University (清华大学); MBZUAI; McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@ 1 ) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@ k ), creating a fundamental quality–exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis–Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

[NLP-54] Signals: Trajectory Sampling and Triage for Agent ic Interactions

【速读】: 该论文旨在解决大规模部署的基于大语言模型的智能体(Agent)系统在后部署阶段难以高效优化的问题。由于智能体交互轨迹具有体积庞大且非确定性的特点,传统的人工或辅助大模型逐条审查方式成本高、效率低。其解决方案的关键在于提出一种轻量级、基于信号(signal-based)的轨迹筛选框架:通过计算无需调用模型即可快速获取的通用信号(如交互错位、停滞、脱离、满意度、执行失败、循环、环境耗尽等),将这些信号作为结构化属性附加到轨迹中,从而实现对潜在高信息量轨迹的高效识别与采样,而不会影响在线智能体行为。实验表明,该方法在τ-bench基准上相比启发式过滤和随机采样能显著提升信息密度(82% vs. 74% 和 54%),并带来1.52倍的每条有效轨迹效率增益,验证了信号机制可作为实用的采样基础设施支持智能体系统的持续优化与偏好数据构建。

链接: https://arxiv.org/abs/2604.00356
作者: Shuguang Chen,Adil Hafeez,Salman Paracha
机构: DigitalOcean Holdings, Inc.
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic applications based on large language models increasingly rely on multi-step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post-deployment remains challenging. Agent trajectories are voluminous and non-deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is slow and cost-prohibitive. We propose a lightweight, signal-based framework for triaging agentic interaction trajectories. Our approach computes cheap, broadly applicable signals from live interactions and attaches them as structured attributes for trajectory triage, identifying interactions likely to be informative without affecting online agent behavior. We organize signals into a coarse-grained taxonomy spanning interaction (misalignment, stagnation, disengagement, satisfaction), execution (failure, loop), and environment (exhaustion), designed for computation without model calls. In a controlled annotation study on \tau -bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82% informativeness rate compared to 74% for heuristic filtering and 54% for random sampling, with a 1.52x efficiency gain per informative trajectory. The advantage is robust across reward strata and task domains, confirming that signals provide genuine per-trajectory informativeness gains rather than merely oversampling obvious failures. These results show that lightweight signals can serve as practical sampling infrastructure for agentic systems, and suggest a path toward preference data construction and post-deployment optimization.

[NLP-55] Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

【速读】: 该论文旨在解决多智能体协作中如何有效选择和连接智能体以完成复杂任务的问题,即“拓扑选择”(topology selection)问题。其核心挑战在于设计一种可学习的、去中心化的通信结构,使多个智能体在执行过程中能够动态调整彼此间的交互关系,从而提升任务准确率并降低资源消耗。解决方案的关键在于提出Agent Q-Mix框架,该框架将拓扑选择建模为一个合作式多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)问题,采用QMIX值分解机制实现去中心化决策,并结合拓扑感知的图神经网络(GNN)编码器、GRU记忆模块及每个智能体独立的Q-head,在集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)范式下优化一个兼顾任务准确率与Token成本的奖励函数,从而实现高效且鲁棒的多智能体协同推理。

链接: https://arxiv.org/abs/2604.00344
作者: Eric Hanchen Jiang,Levina Li,Rui Sun,Xiao Liang,Yubei Li,Yuchen Wu,Haozheng Luo,Hengli Li,Zhi Zhang,Zhaolu Kang,Kai-Wei Chang,Ying Nian Wu
机构: University of California Los Angeles (加州大学洛杉矶分校); University of Washington (华盛顿大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbfAgent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity’s Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8% accuracy, outperforming Microsoft Agent Framework (19.2%) and LangGraph (19.2%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

[NLP-56] Large Language Models in the Abuse Detection Pipeline

【速读】: 该论文旨在解决在线滥用行为(online abuse)检测中传统机器学习方法在应对复杂、动态威胁模式和政策要求时的局限性问题,例如静态分类器难以适应新型恶意行为、人工标注成本高且效率低等。其核心解决方案是引入大语言模型(Large Language Models, LLMs),利用其强大的上下文推理能力、政策理解能力、解释生成能力和跨模态理解能力,系统性地赋能滥用检测生命周期(Abuse Detection Lifecycle, ADL)的四个关键阶段:标签与特征生成、检测、审查与申诉、审计与治理。LLMs通过提供更灵活、可解释且具备泛化能力的分析机制,显著提升了安全系统的响应速度与准确性,同时推动了从技术实现到治理规范的闭环演进。

链接: https://arxiv.org/abs/2604.00323
作者: Suraj Kath,Sanket Badhe,Preet Shah,Ashwin Sampathkumar,Shivani Gupta
机构: Google LLC(谷歌)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Online abuse has grown increasingly complex, spanning toxic language, harassment, manipulation, and fraudulent behavior. Traditional machine-learning approaches dependent on static classifiers and labor-intensive labeling struggle to keep pace with evolving threat patterns and nuanced policy requirements. Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems. This survey provides a lifecycle-oriented analysis of how LLMs are being integrated into the Abuse Detection Lifecycle (ADL), which we define across four stages: (I) Label \ Feature Generation, (II) Detection, (III) Review \ Appeals, and (IV) Auditing \ Governance. For each stage, we synthesize emerging research and industry practices, highlight architectural considerations for production deployment, and examine the strengths and limitations of LLM-driven approaches. We conclude by outlining key challenges including latency, cost-efficiency, determinism, adversarial robustness, and fairness and discuss future research directions needed to operationalize LLMs as reliable, accountable components of large-scale abuse-detection and governance systems.

[NLP-57] Asymmetric Actor-Critic for Multi-turn LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中难以保证可靠行为的问题,尤其是在无法重试的一次性任务场景下。现有方法或依赖于反思(reflection)或事后评估(post-hoc evaluation),需额外尝试;或假设模型可完全训练,无法利用现成的专有大模型。其解决方案的关键在于提出一种不对称的演员-评论家(actor-critic)框架:由一个强大的专有LLM作为固定演员(actor)负责生成响应,同时引入一个轻量级开源评论家(critic)在运行时提供监督,实时监控并干预当前交互轨迹中的动作。该设计利用了“生成-验证不对称性”——高质量生成需要大规模模型,而有效监督可用小型模型实现,并通过无需修改演员的自动化数据生成管道为评论家提供训练信号,从而在不改变演员的前提下显著提升对话可靠性与任务成功率。

链接: https://arxiv.org/abs/2604.00304
作者: Shuli Jiang,Zhaoyang Zhang,Yi Zhang,Shuo Yang,Wei Xia,Stefano Soatto
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor’s actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on \tau -bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.

[NLP-58] Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures

【速读】: 该论文试图解决的问题是:如何从生物学角度理解人类语言的本质及其演化机制,即语言作为人类内在认知能力(而非文化习得产物)的生物基础。其解决方案的关键在于提出并论证“生成式语法中的MERGE操作”作为语言计算系统的核心机制,这一数学代数模型不仅揭示了语言结构的自然本质,还为生物学家、遗传学家和神经科学家提供了可操作的研究路径,通过明确语法的形式化特征来约束神经机制的候选模型,并推动神经计算研究将理论约束转化为可检验的假设。

链接: https://arxiv.org/abs/2604.00291
作者: Elliot Murphy
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language. It treats language as an innate biological organ or faculty of the mind, rather than a cultural tool, and it challenges a behaviorist conception of human language acquisition as being based on stimulus-response associations. Extracting its most essential component, it takes seriously the idea that mathematical, algebraic models of language capture something natural about the world. The syntactic structure-building operation of MERGE is thought to offer the scientific community a “real joint of nature”, “a (new) aspect of nature” (Mukherji 2010), not merely a formal artefact. This mathematical theory of language is then seen as being able to offer biologists, geneticists and neuroscientists clearer instructions for how to explore language. The argument of this chapter proceeds in four steps. First, I clarify the object of inquiry for biolinguistics: not speech, communication, or generic sequence processing, but the internal computational system that generates hierarchically structured expressions. Second, I argue that this formal characterization matters for evolutionary explanation, because different conceptions of syntax imply different standards of what must be explained. Third, I suggest that a sufficiently explicit algebraic account of syntax places non-trivial constraints on candidate neural mechanisms. Finally, I consider how recent neurocomputational work begins to transform these constraints into empirically tractable hypotheses, while also noting the speculative and revisable character of the present program.

[NLP-59] Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

【速读】: 该论文旨在解决生成式 AI(Generative AI)在医疗问答(Medical QA)任务中,通过自我反思(self-reflective)提示是否能有效提升模型推理的准确性和可靠性这一关键问题。其解决方案的关键在于设计并实施一种迭代式自我反思机制,对比标准链式思维(Chain-of-Thought, CoT)提示与包含多轮自我批判与修正的自省循环在三个主流医学QA基准(MedQA、HeadQA 和 PubMedQA)上的表现,从而系统评估自我反思对错误修正、错误持续或新错误引入的影响。研究发现,自我反思并非普适有效的改进策略,其效果高度依赖于数据集和模型类型,且增加反思步骤并不必然提升性能,揭示了推理透明性与推理正确性之间的显著差距。

链接: https://arxiv.org/abs/2604.00261
作者: Zaifu Zhan,Mengyuan Cui,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

[NLP-60] LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育评估中与人类评分一致性不足的问题,特别是针对指令微调的大语言模型(instruction-tuned LLMs)在整体评分(holistic scoring)和分析性评分(analytic scoring)上的表现差异。研究发现,尽管模型在整体评分上能达到中等到较高的一致性(如Quadratic Weighted Kappa约为0.6),但在分析性评分中存在显著且稳定的负向偏差(negative directional bias),尤其体现在低阶关注点(Lower-Order Concern, LOC)如语法和标点等方面,即模型倾向于比人类评分者更严格地打分。解决方案的关键在于:通过小规模人工标注的偏差估计集(bias-estimation sets)来量化并校正系统性偏差,而非直接使用零样本输出;这种方法无需大规模微调即可实现有效校准,从而提升模型在教育场景中的可信赖性和实用性。

链接: https://arxiv.org/abs/2604.00259
作者: Filip J. Kucia,Anirban Chakraborty,Anna Wróblewska
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-style prompts in multi-trait analytic scoring. To quantify the amount of data needed to detect these systematic deviations, we compute the minimum sample size at which a 95% bootstrap confidence interval for the mean bias excludes zero. This analysis shows that LOC bias is often detectable with very small validation sets, whereas Higher-Order Concern (HOC) traits typically require much larger samples. These findings support a bias-correction-first deployment strategy: instead of relying on raw zero-shot scores, systematic score offsets can be estimated and corrected using small human-labeled bias-estimation sets, without requiring large-scale fine-tuning.

[NLP-61] REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

【速读】: 该论文旨在解决当前自动化同行评审系统主要依赖文本内容而忽视图表等视觉元素及外部学术信号的问题,从而导致生成的评审意见在上下文关联性与质量上存在局限。其解决方案的关键在于提出一种基于强化学习的框架REM-CTX,通过引入对应感知的奖励函数(correspondence-aware reward functions)将辅助上下文(如图表、外部文献)有效整合进评审生成过程;具体而言,该方法采用80亿参数语言模型,并利用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练,同时结合多维度质量奖励与两个专门设计的对应奖励,分别引导评审内容与辅助上下文之间的对齐,实验证明该方案在多个科学领域均显著优于六种基线模型,尤其在上下文锚定性指标上超越了更大规模的商用模型。

链接: https://arxiv.org/abs/2604.00248
作者: Pawin Taechoyotin,Daniel E. Acuna
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

[NLP-62] A Taxonomy of Programming Languages for Code Generation

【速读】: 该论文旨在解决编程语言(Programming Language, PL)资源分布不均的问题,即当前多数大型语言模型(Large Language Model, LLM)训练数据集中高度集中于少数高资源编程语言,而大量低资源编程语言缺乏有效支持。为填补这一空白,作者提出了首个可复现的编程语言资源分类体系,将646种编程语言划分为四个层级(Tier 0–3),并基于七个主流代码语料库中的token占比揭示了显著的系统性不平等:仅1.9%的高资源语言(Tier 3)贡献了74.6%的tokens,而71.7%的稀缺语言(Tier 0)仅占1.0%。解决方案的关键在于建立一个结构化的、基于实证统计分析的资源分级框架,为多语言LLM的评估与数据集构建提供可操作的基准。

链接: https://arxiv.org/abs/2604.00239
作者: Nishat Raihan,Christian Newman,Marcos Zampieri
机构: George Mason University, USA; Rochester Institute of Technology, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The world’s 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.

[NLP-63] Do Language Models Know When Theyll Refuse? Probing Introspective Awareness of Safety Boundaries

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对有害请求时是否能够准确预测自身拒绝行为的问题,即模型能否在生成响应前具备对自身安全决策的自我感知能力。研究通过系统性实验设计,让模型先预测其是否会拒绝特定请求,再在全新上下文中作出实际响应,从而评估其“内省敏感性”(introspective sensitivity)。解决方案的关键在于引入信号检测理论(Signal Detection Theory, SDT)量化模型的预测准确性,并发现:尽管所有测试模型均表现出高内省敏感性(d’ = 2.4–3.5),但在安全边界处敏感性显著下降;同时,高置信度预测可作为可靠指标——仅保留高置信度预测时,表现良好的模型能达到98.3%的准确率,为安全关键场景下的基于置信度的路由机制提供了可行路径。

链接: https://arxiv.org/abs/2604.00228
作者: Tanay Gondil
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d’ = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

[NLP-64] Do LLM s Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中频繁违反情境隐私(contextual privacy)的问题,即模型在应保持信息敏感性的场合仍会泄露私密信息,尽管其内部可能已编码了情境隐私规范。研究发现,情境隐私的三个核心参数——信息类型(information type)、接收方(recipient)和传输原则(transmission principle)——以线性可分且功能独立的方向存在于模型激活空间中,表明LLMs具备对情境隐私的结构化表征能力。然而,这种表征与实际行为之间存在显著偏差,导致隐私泄露持续发生。解决方案的关键在于提出基于情境完整性(Contextual Integrity, CI)参数的定向控制方法(CI-parametric steering),通过独立干预每个CI维度实现对模型行为的精准调节,从而比传统的整体式干预更有效、可预测地减少隐私违规,揭示了隐私失败源于表征与行为之间的错位,而非缺乏认知,并为提升LLMs的情境隐私理解提供了新路径。

链接: https://arxiv.org/abs/2604.00209
作者: Haoran Wang,Li Xiong,Kai Shu
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

[NLP-65] Polish phonology and morphology through the lens of distributional semantics

【速读】: 该论文旨在探究波兰语词汇的音系结构(phonological structure)与形态结构(morphological structure)与其语义之间的关系,具体关注包含辅音丛(consonant clusters)的词是否在语义空间中体现其形式特征。解决方案的关键在于利用分布语义学(Distributional Semantics)方法,结合t-SNE、线性判别分析(Linear Discriminant Analysis)和线性判别学习(Linear Discriminative Learning)等统计与计算技术,发现语义向量不仅编码丰富的句法形态信息(如时态、数、格等),还能捕捉到音位串(phoneme strings)等亚词素层级的语言单位信息。研究进一步表明,基于语义嵌入的判别式词典模型(discriminative lexicon model)可实现对语言理解与产出的高度准确预测,这得益于语义空间中存在与形式空间高度同构的信息结构。

链接: https://arxiv.org/abs/2604.00174
作者: Paula Orzechowska,R. Harald Baayen
机构: Adam Mickiewicz University; University of Tübingen
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that – apart from encoding rich morphosyntactic information (e.g. tense, number, case) – semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.

[NLP-66] ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生产环境中多模型路由的动态成本-质量权衡问题,尤其在模型价格波动、质量退化及新模型热插拔等非平稳场景下,传统静态路由策略难以维持预算约束并保持服务质量。解决方案的关键在于提出 ParetoBandit——一个基于成本感知上下文 bandits 的自适应路由器,其核心创新包括:(1) 在线对偶预算控制器(online primal-dual budget pacer),通过闭环控制实现每请求成本上限的实时保障;(2) 基于充分统计量的几何遗忘机制(geometric forgetting),使系统能快速响应价格与质量变化,同时利用离线先验进行冷启动加速;(3) 热交换注册表(hot-swap registry),支持运行时新增或移除模型,并通过有限探索期识别其质量-成本帕累托前沿,从而实现无中断的模型迭代部署。实验表明,该方案可在多种预算约束下稳定控制成本偏差小于 0.4%,并在条件突变时自动调整路由策略,具备高效率与强鲁棒性。

链接: https://arxiv.org/abs/2604.00136
作者: Annette Taberner-Miller
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 27 pages, 15 figures, 13 tables. Code available at this https URL

点击查看摘要

Abstract:Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU – less than 0.4% of typical inference time – with the routing decision itself taking just 22.5us. Comments: 27 pages, 15 figures, 13 tables. Code available at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) MSC classes: 68T05, 62L05 ACMclasses: I.2.6; I.2.11; C.4 Cite as: arXiv:2604.00136 [cs.LG] (or arXiv:2604.00136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-67] Oblivion: Self-Adaptive Agent ic Memory Control through Decay-Driven Activation

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在长期交互中因“始终开启”的记忆检索机制和“扁平化”存储导致的高干扰与延迟问题。传统LLM代理缺乏对记忆的动态控制能力,无法模拟人类通过选择性遗忘实现的记忆优化机制。解决方案的关键在于提出Oblivion框架,其核心思想是将遗忘建模为可衰减的可访问性降低而非显式删除,并将记忆控制解耦为读路径(read path)与写路径(write path):读路径基于代理不确定性与记忆缓冲区充足性决定是否访问记忆,避免冗余查询;写路径则强化对生成响应有贡献的记忆内容,从而实现高层策略的持久保留与细节信息的按需加载,形成分层记忆组织结构。此设计使模型能在变化环境中动态平衡学习与遗忘,显著提升长时程推理效率。

链接: https://arxiv.org/abs/2604.00131
作者: Ashish Rana,Chia-Chien Hung,Qumeng Sun,Julian Martin Kunkel,Carolin Lawrence
机构: NEC Laboratories Europe (NEC实验室欧洲); GWDG, Georg-August-Universität Göttingen (GWDG,哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, and 4 tables

点击查看摘要

Abstract:Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory-augmented LLM agents rely on “always-on” retrieval and “flat” memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay-driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always-on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high-level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long-horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM-agentic reasoning. The source code is available at this https URL.

[NLP-68] Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

【速读】: 该论文旨在解决传统Chain-of-Thought (CoT) 提示方法在复杂多步推理任务中因采用非结构化、扁平化推理链而导致的冗余和性能不佳问题。其解决方案的关键在于提出Hierarchical Chain-of-Thought (Hi-CoT) 提示范式,通过交替进行指令规划与逐步执行,将推理过程分解为层级化的子步骤,从而提升大语言模型(LLMs)对长推理周期的管理能力并保持逻辑连贯性。实证结果表明,Hi-CoT在多个数学推理基准上平均准确率提升6.2%(最高达61.4%),同时推理轨迹长度减少13.9%,且严格遵循层级结构时性能最优。

链接: https://arxiv.org/abs/2604.00130
作者: Xingshuai Huang,Derek Li,Bahareh Nikpour,Parsa Omidi
机构: Huawei Technologies Canada (华为加拿大技术公司); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at this https URL.

[NLP-69] Hierarchical Pre-Training of Vision Encoders with Large Language Models CVPR

【速读】: 该论文旨在解决现有视觉语言模型中视觉编码器与大语言模型(Large Language Models, LLMs)作为独立模块集成时,难以有效融合层次化视觉特征的问题。其关键解决方案是提出HIVE(Hierarchical Pre-Training of Vision Encoders)框架,通过在视觉编码器与LLM之间引入层次化交叉注意力机制,实现多层特征的结构化融合,从而改善梯度流动和表征学习能力,并采用三阶段训练策略逐步对齐视觉编码器与LLM,确保优化稳定性和多模态融合效果。

链接: https://arxiv.org/abs/2604.00086
作者: Eugene Lee,Ting-Yu Chang,Jui-Huang Tsai,Jiajie Diao,Chen-Yi Lee
机构: University of Cincinnati (辛辛那提大学); National Yang Ming Chiao Tung University (阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 14 figures, accepted to Computer Vision and Pattern Recognition Conference (CVPR) Workshops 2026. 5th MMFM Workshop: What is Next in Multimodal Foundation Models?

点击查看摘要

Abstract:The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.

[NLP-70] rminal Agents Suffice for Enterprise Automation

【速读】: 该论文旨在解决企业自动化任务中复杂代理系统(如基于图形界面的网络代理或工具增强型代理)是否必要这一问题,其核心挑战在于评估这些高成本、高运维开销的架构是否真正优于更简单的方案。解决方案的关键在于提出并验证一种仅依赖终端(terminal)和文件系统(filesystem)的编码代理(coding agent),通过直接调用平台API实现任务执行,从而在多个真实企业系统上证明其性能可与甚至超越复杂的代理架构,表明强基础模型结合低级程序化接口已足以支撑实用的企业自动化需求。

链接: https://arxiv.org/abs/2604.00073
作者: Patrice Bechard,Orlando Marquez Ayala,Emily Chen,Jordan Skelton,Sagar Davasam,Srinivas Sunkara,Vikas Yadav,Sai Rajeswar
机构: ServiceNow(服务-now)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre-print. Under review for COLM2026

点击查看摘要

Abstract:There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.

[NLP-71] Multi-lingual Multi-institutional Electronic Health Record based Predictive Model

【速读】: 该论文旨在解决多国重症监护病房(Intensive Care Unit, ICU)电子健康记录(Electronic Health Record, EHR)预测中因数据模式(schema)和编码系统异质性带来的跨机构学习障碍,尤其针对语言差异这一额外挑战。其关键解决方案是采用基于文本的标准化框架,并通过两种策略处理多语言问题:一是直接使用多语言编码器建模多语种病历,二是利用大语言模型(Large Language Model, LLM)进行词级翻译将非英语记录统一转译为英文。实验表明,翻译驱动的语言对齐策略在七个公共ICU数据集上的十项临床任务中表现更稳定、可靠,显著优于多语言编码器方法,且整体模型性能超越需人工特征选择与标准化的基线及单数据集训练模型,从而实现了无需手动标准化即可跨国家、跨语言的EHR预测建模,为未来全球多中心EHR研究提供了可扩展的路径。

链接: https://arxiv.org/abs/2604.00027
作者: Kyunghoon Hur,Heeyoung Kwak,Jinsu Jang,Nakhwan Kim,Edward Choi
机构: Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (韩国科学技术院人工智能研究生院); NAVER Digital Healthcare LAB (NAVER数字医疗实验室); Department of Health and Medical Information, Ansan University (安山大学健康与医学信息系); Institute of Human Behavior and Genetics, Korea University College of Medicine (高丽大学医学院人类行为与遗传研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: On revision stage, 10 main pages, 3 supplementary pages

点击查看摘要

Abstract:Large-scale EHR prediction across institutions is hindered by substantial heterogeneity in schemas and code systems. Although Common Data Models (CDMs) can standardize records for multi-institutional learning, the manual harmonization and vocabulary mapping are costly and difficult to scale. Text-based harmonization provides an alternative by converting raw EHR into a unified textual form, enabling pooled learning without explicit standardization. However, applying this paradigm to multi-national datasets introduces an additional layer of heterogeneity, which is “language” that must be addressed for truly scalable EHRs learning. In this work, we investigate multilingual multi-institutional learning for EHR prediction, aiming to enable pooled training across multinational ICU datasets without manual standardization. We compare two practical strategies for handling language barriers: (i) directly modeling multilingual records with multilingual encoders, and (ii) translating non-English records into English via LLM-based word-level translation. Across seven public ICU datasets, ten clinical tasks with multiple prediction windows, translation-based lingual alignment yields more reliable cross-dataset performance than multilingual encoders. The multi-institutional learning model consistently outperforms strong baselines that require manual feature selection and harmonization, and also surpasses single-dataset training. We further demonstrate that text-based framework with lingual alignment effectively performs transfer learning via few-shot fine-tuning, with additional gains. To our knowledge, this is the first study to aggregate multilingual multinational ICU EHR datasets into one predictive model, providing a scalable path toward language-agnostic clinical prediction and future global multi-institutional EHR research.

[NLP-72] “Who Am I and Who Else Is Here?” Behavioral Differentiation Without Role Assignment in Multi-Agent LLM Systems

【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)在共享对话环境中是否会产生差异化社会角色或趋向行为趋同的问题。其核心发现表明,LLM群体在交互中会自发形成具有结构性的行为多样性,这种多样性并非随机,而是由模型架构异质性、群体上下文和提示(prompt)结构共同驱动的系统性现象。解决方案的关键在于构建了一个受控实验平台,通过在统一推理后端上协调7个异构LLM进行多轮讨论,并系统性地改变群体组成、命名方式和提示设计,在12个实验系列中共完成208次运行(共编码13,786条消息),结合双LLM判官独立标注与人类验证,实现了高信度的行为编码(Cohen’s kappa均值达0.73–0.78),从而确证了交互环境对LLM行为分化的重要作用。

链接: https://arxiv.org/abs/2604.00026
作者: Houssam EL Kandoussi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 11 figures, 5 tables

点击查看摘要

Abstract:When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi-agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen’s kappa = 0.78 with conservative intersection-based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 vs. 0.85; p 10^-5, r = 0.70); (2) groups spontaneously exhibit compensatory response patterns when an agent crashes; (3) revealing real model names significantly increases behavioral convergence (cosine 0.56 to 0.77, p = 0.001); and (4) removing all prompt scaffolding converges profiles to homogeneous-level similarity (p 0.001). Critically, these behaviors are absent when agents operate in isolation, confirming that behavioral diversity is a structured, reproducible phenomenon driven by the interaction of architectural heterogeneity, group context, and prompt-level scaffolding.

[NLP-73] Brevity Constraints Reverse Performance Hierarchies in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在标准评估协议下出现的“反直觉性能下降”现象,即在部分基准任务中,参数量更大的模型反而表现劣于小模型。研究发现,这种现象源于模型规模引发的自发性冗长输出(spontaneous scale-dependent verbosity),导致因过度阐述引入错误。解决方案的关键在于采用尺度感知的提示工程(scale-aware prompt engineering),通过约束大模型输出长度以抑制冗余表达,从而显著提升准确性(平均提升26个百分点)并逆转原本的性能劣势——尤其在数学推理和科学知识类任务上,大模型甚至展现出7.7–15.9个百分点的优势,证明其潜在能力被通用提示策略所掩盖。

链接: https://arxiv.org/abs/2604.00025
作者: MD Azizul Hakim
机构: Bangladesh Sweden Polytechnic Institute (孟加拉国瑞典理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models – direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.

[NLP-74] WHBench: Evaluating Frontier LLM s with Expert-in-the-Loop Validation on Womens Health Topics

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在女性健康(Women’s Health)领域评估不足的问题,尤其是现有基准测试未能充分暴露临床关键性失效模式,如过时指南、安全隐患、剂量错误及公平性盲区。其解决方案的关键在于构建了一个名为“女性健康基准”(Women’s Health Benchmark, WHBench)的专门评估套件,包含47个由专家设计的场景,覆盖10个女性健康主题,并采用23项标准的评分体系,涵盖临床准确性、安全性、沟通质量、指南遵循度等多个维度,同时引入安全加权惩罚和服务器端分数重算机制,确保评价结果可靠且具有临床意义。该基准揭示了当前主流模型在女性健康任务中整体表现有限(最高仅72.1%),凸显了需加强专家监督与系统改进的必要性。

链接: https://arxiv.org/abs/2604.00024
作者: Sneha Maurya,Pragya Saboo,Girish Kumar
机构: Columbia University (哥伦比亚大学); Rubric AI (Rubric AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for medical guidance, but women’s health remains under-evaluated in benchmark design. We present the Women’s Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women’s health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in harm rates. Inter-rater reliability is moderate at the response label level but high for model ranking, supporting WHBench utility for comparative system evaluation while highlighting the need for expert oversight in clinical deployment. WHBench provides a public, failure-mode-aware benchmark to track safer and more equitable progress in womens health AI.

[NLP-75] Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

【速读】: 该论文旨在解决 Sulawesi Austronesian 语言中大量非规范词汇是否源自前南岛语系(pre-Austronesian)底层语言,抑或为独立创新的问题。传统方法难以区分这两种可能性,因此作者提出结合规则驱动的同源词剔除(cognate subtraction)与基于音系特征的机器学习分类器(XGBoost),以识别潜在的底层词。关键在于利用26个音系特征训练分类模型,发现非主流词汇具有特定音系指纹:词长更长、辅音簇更多、喉塞音频率更高且南岛语前缀较少;同时通过跨方法一致性验证(Cohen’s kappa=0.61)筛选出266个高置信度候选词,但聚类分析未发现清晰的词族结构(silhouette=0.114),表明不存在单一的前南岛语系底层语言。这一方法有效补充了传统比较语言学,但也警示不应将音系不一致简单等同于共享底层语言。

链接: https://arxiv.org/abs/2604.00023
作者: Mukhlis Amien,Go Frendi Gunawan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 31 pages, 4 figures, 5 tables. Submitted to Oceanic Linguistics

点击查看摘要

Abstract:Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen’s kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.

[NLP-76] Criterion Validity of LLM -as-Judge for Business Outcomes in Conversational Commerce

【速读】: 该论文旨在解决多维评分量表(multi-dimensional rubric)在对话评估中缺乏准则效度(criterion validity)的问题,即现有评分维度是否真正与下游业务结果(如用户转化率)相关联。其解决方案的关键在于通过两阶段实证研究验证一个7维评分量表(由大语言模型作为评判者实现)与真实业务转化之间的关联性,并揭示不同维度对转化的异质性影响:其中“需求挖掘”(Need Elicitation, D1)和“节奏策略”(Pacing Strategy, D3)显著正相关,而“情境记忆”(Contextual Memory, D5)无显著关联;由此导致等权重综合评分存在“复合稀释效应”,通过基于转化数据重新加权可部分纠正该问题(相关系数从0.272提升至0.351),并进一步发现AI代理因缺乏信任构建行为而导致销售效果不佳,最终提出三层评估架构以推动准则效度测试成为对话评估的标准实践。

链接: https://arxiv.org/abs/2604.00022
作者: Liang Chen,Qi Liu,Wenhuan Lin,Feng Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity – whether quality scores are associated with the downstream outcomes they are meant to serve – remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions – a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3’s association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading “evaluation-outcome paradox,” which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

[NLP-77] How Do Language Models Process Ethical Instructions? Deliberation Consistency and Other-Recognition Across Four Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)中伦理指令如何被模型内部处理的问题,即在缺乏可解释机制的情况下,如何理解语言模型对道德规范的响应是否真正内化。其核心问题是:伦理指令是否能引发模型深层的道德推理,还是仅表现为表面合规?解决方案的关键在于提出并验证三个新指标——Deliberation Depth(DD, deliberation depth)、Value Consistency Across Dilemmas(VCAD,跨困境价值一致性)和Other-Recognition Index(ORI,他人识别指数),通过多模型、多语言、多指令格式的600余次模拟实验,识别出四种不同的伦理处理类型(Output Filter、Defensive Repetition、Critical Internalization、Principled Consistency),并揭示处理能力(DD)与指令格式之间的交互效应:低DD模型不受指令形式影响,而高DD模型中,“ reasoned norm”与“virtue framing”产生相反效果。这一发现表明,表层合规性(lexical compliance)与真实伦理处理过程几乎无关(r = -0.161 至 +0.256, p > 0.22),从而为构建真正安全且具道德内化的AI系统提供了结构化认知框架。

链接: https://arxiv.org/abs/2604.00021
作者: Hiroki Fukui
机构: Kyoto University (京都大学); Meta (Meta); OpenAI (OpenAI); Alibaba Cloud (阿里巴巴云); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 34 pages, 7 figures, 4 tables. Preprint. OSF pre-registration: this http URL . Companion paper: arXiv:2603.04904

点击查看摘要

Abstract:Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ( \mathrmBF_10 10 for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics – Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) – revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ( r = -0.161 to +0.256 , all p .22 ; N = 24 ; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

[NLP-78] Detecting Abnormal User Feedback Patterns through Temporal Sentiment Aggregation

【速读】: 该论文旨在解决实时用户反馈中异常事件(如恶意评论攻击或用户满意度骤降)的早期检测问题,传统情感分析方法因无法捕捉短文本中的集体行为变化而效果有限。解决方案的关键在于提出一种时间情感聚合框架(temporal sentiment aggregation framework),利用预训练的Transformer语言模型(如RoBERTa)提取每条评论的情感信号,并将其聚合为时间窗口级别的情感得分,通过显著下降趋势识别潜在异常模式,从而实现对用户反馈动态变化的有效监测与解释。

链接: https://arxiv.org/abs/2604.00020
作者: Yalun Qi,Sichen Zhao,Zhiming Xue,Xianling Zeng,Zihan Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In many real-world applications, such as customer feedback monitoring, brand reputation management, and product health tracking, understanding the temporal dynamics of user sentiment is crucial for early detection of anomalous events such as malicious review campaigns or sudden declines in user satisfaction. Traditional sentiment analysis methods focus on individual text classification, which is insufficient to capture collective behavioral shifts over time due to inherent noise and class imbalance in short user comments. In this work, we propose a temporal sentiment aggregation framework that leverages pretrained transformer-based language models to extract per-comment sentiment signals and aggregates them into time-window-level scores. Significant downward shifts in these aggregated scores are interpreted as potential anomalies in user feedback patterns. We adopt RoBERTa as our core semantic feature extractor and demonstrate, through empirical evaluation on real social media data, that the aggregated sentiment scores reveal meaningful trends and support effective anomaly detection. Experiments on real-world social media data demonstrate that our method successfully identifies statistically significant sentiment drops that correspond to coherent complaint patterns, providing an effective and interpretable solution for feedback anomaly monitoring.

[NLP-79] he Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成中事实性(factuality)评估不足的问题,尤其是针对多语言场景下缺乏系统性、可控的实体数据集来验证LLMs生成内容的真实性。其解决方案的关键在于构建一个可配置的流水线(configurable pipeline),基于Wikipedia和Wikidata数据自动生成具有指定特征(如领域、地理位置和流行度)的多语言实体集合——以RiDiC数据集为例,该数据集包含河流、自然灾害和汽车型号三个领域的3000个实体,每个实体附带中英文名称、地理位置及对应维基百科内容,用于驱动LLMs生成并由第三方事实核查工具进行评估。实验表明,即使前沿模型在这些实体上也出现幻觉现象,证明该方法能有效识别LLMs在多语言长文本生成中的事实错误。

链接: https://arxiv.org/abs/2604.00019
作者: Pavel Braslavski,Dmitrii Iarosh,Nikita Sushko,Andrey Sakhovskiy,Vasily Konovalov,Elena Tutubalina,Alexander Panchenko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs’ long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains – rivers, natural disasters, and car models – spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs’ responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs’ long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.

[NLP-80] hink Twice Before You Write – an Entropy-based Decoding Strategy to Enhance LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因传统解码策略(如贪婪解码、束搜索和采样方法)导致的误差传播与计算效率低下的问题。现有方法要么缺乏鲁棒性,要么难以平衡准确性与资源消耗。其解决方案的关键在于提出一种基于熵引导的自适应解码框架:通过在每一步计算词元分布的熵值,识别高不确定性位置并仅在这些脆弱点进行分支扩展,从而动态维护部分推理路径池,在不确定区域集中计算资源,避免在高置信度区域冗余探索;同时引入 rollout-level Entropy After /Think (EAT) 停止准则,于完整推理轨迹后评估熵值以实现高效终止,显著提升小模型上的推理性能,使其在准确率上媲美GPT-5,但成本仅为后者的一小部分。

链接: https://arxiv.org/abs/2604.00018
作者: Jiashu He,Meizhu Liu,Olaitan P Olaleye,Amit Agarwal,M. Avendi,Yassi Abbasi,Matthew Rowe,Hitesh Laxmichand Patel,Paul Li,Tao Sheng,Sujith Ravi,Dan Roth
机构: Oracle AI Science; University of Pennsylvania
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches introduce randomness without adequate robustness. Self-consistency improves reliability by aggregating multiple rollouts, but incurs significant computational overhead. We propose an entropy-guided decoding framework that introduces token-level adaptivity into generation. At each step, the model computes the entropy of the token distribution, identifies high-uncertainty positions, and selectively branches on these vulnerable points. A dynamic pool of partial rollouts is maintained and expanded until solutions are completed, concentrating computation where uncertainty is greatest and avoiding unnecessary exploration in confident regions. To enable efficient termination, we apply a rollout-level Entropy After /Think (EAT) stopping criterion by performing entropy evaluation after the full reasoning trace, rather than incrementally at every step. Experiments on GSM8K, AMC2023, and their perturbed variants demonstrate that our method achieves consistently strong accuracy. Notably, on smaller LLMs, performance is comparable to GPT-5 while operating at a fraction of the cost.

[NLP-81] Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora

【速读】: 该论文旨在解决心理概念在科学文献与大众媒体话语之间是否存在语义演变的问题,特别是探讨这些概念如何因传播语境的不同而发生意义变迁。其解决方案的关键在于运用分布语义学(distributional semantics)方法,对俄语语料库进行分析:构建了包含约30万词元的科学语料库和约120万词元的大众科普语料库,通过预处理、频率分析、聚类及语义关联识别等技术手段,量化并比较了关键心理概念(如“倦怠”和“抑郁”)在两类语境中的语义结构差异,从而揭示从专业术语向日常经验化表述的语义迁移现象。

链接: https://arxiv.org/abs/2604.00017
作者: Orlova Anastasia
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article examines semantic shifts in psychological concepts across scientific and popular media discourse using methods of distributional semantics applied to Russian-language corpora. Two corpora were compiled: a scientific corpus of approximately 300 research articles from the journals Psychology. Journal of the Higher School of Economics and Vestnik of Saint Petersburg University. Psychology (767,543 tokens) and a popular science corpus consisting of texts from the online psychology platforms Yasno and Chistye kogntsii (1,199,150 tokens). After preprocessing (OCR recognition, lemmatization, removal of stop words and non-informative characters), the corpora were analyzed through frequency analysis, clustering, and the identification of semantic associations. The results reveal significant differences in vocabulary and conceptual framing between the two discourse types: scientific texts emphasize methodological and clinical terminology, while popular science materials foreground everyday experience and therapeutic practice. A comparison of semantic associations for key concepts such as burnout and depression shows that scientific discourse links these terms to psychological resources, symptomatology, and diagnostic constructs, whereas popular science discourse frames them through personal narratives, emotions, and everyday situations. These findings demonstrate a clear shift from precise professional terminology toward more generalized and experiential meanings in popular media discourse and confirm the effectiveness of distributional semantics methods for identifying semantic transformations of psychological concepts across different communicative contexts.

[NLP-82] Are they human? Detecting large language models by probing human memory constraints

【速读】: 该论文试图解决在线行为研究中参与者身份识别的问题,即如何有效区分人类与基于大语言模型(Large Language Models, LLMs)的自动化代理(agent),以保障研究结果的有效性。传统方法依赖于人类可解而机器难以应对的简单挑战,但当前通用型LLM已能轻松完成此类任务,从而削弱了检测可靠性。论文提出的关键解决方案是利用人类认知能力的一个稳定特征——有限的工作记忆容量(limited working memory capacity),设计一种基于标准序列回忆任务的认知建模方法。即使LLM被明确指令模拟人类工作记忆限制,其表现仍因过度优化和缺乏真实认知约束而偏离人类模式,从而可通过计算建模精准识别出非人类参与者。

链接: https://arxiv.org/abs/2604.00016
作者: Simon Schug,Brenden M. Lake
机构: Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.

[NLP-83] ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

【速读】: 该论文旨在解决阿拉伯语科学文本翻译资源匮乏的问题,特别是现有英文-阿拉伯语平行语料库多基于短句或单一领域,难以满足高质量科学翻译评估的需求。解决方案的关键在于构建了一个高质量的多领域科学平行语料库ASCAT,其通过系统化的多引擎翻译(包括生成式AI、基于Transformer的模型及商用机器翻译API)与专家人工验证相结合的流程,确保翻译在词汇、句法和语义层面的准确性;该语料库覆盖物理、数学、计算机科学、量子力学和人工智能五个领域,包含67,293个英文词元和60,026个阿拉伯词语元,具备丰富的形态学特征,可有效支撑科学翻译质量评估与专用模型训练。

链接: https://arxiv.org/abs/2604.00015
作者: Serry Sibaee,Khloud Al Jallad,Zineb Yousfi,Israa Elsayed Elhosiny,Yousra El-Ghawi,Batool Balah,Omer Nacar
机构: Prince Sultan University (王子苏丹大学); SySSR; NAMAA Community; Independent Linguist; Tuwaiq Academy (图瓦克学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \textttquickmt-en-ar), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus contains 67,293 English tokens and 60,026 Arabic tokens, with an Arabic vocabulary of 17,604 unique words reflecting the morphological richness of the language. We benchmark three state-of-the-art LLMs on the corpus GPT-4o-mini (BLEU: 37.07), Gemini-3.0-Flash-Preview (BLEU: 30.44), and Qwen3-235B-A22B (BLEU: 23.68) demonstrating its discriminative power as an evaluation benchmark. ASCAT addresses a critical gap in scientific MT resources for Arabic and is designed to support rigorous evaluation of scientific translation quality and training of domain-specific translation models.

[NLP-84] MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中模型可解释性不足与强化学习(Reinforcement Learning, RL)训练效率低的问题。现有方法在使用链式思维(Chain-of-Thought, CoT)推理时面临高标注成本,而RL则因探索效率低和奖励稀疏性(尤其在困难样本上)难以有效优化。其解决方案的关键在于提出一种融合结构化“判别-校准”(Discrimination-Calibration, DC)推理机制与基于提示的强化学习(Hint-based Reinforcement Learning)的新训练框架——首先通过教师模型合成高质量CoT数据进行冷启动监督微调(SFT),使学生模型从初始阶段即具备DC推理范式;随后设计Hint-GRPO算法,利用DC中的判别阶段作为可验证锚点,在RL过程中为困难样本提供方向性提示,从而缓解奖励稀疏问题并提升策略优化效率。该方法不仅显著提升了细粒度情感回归任务的准确性与推理链条质量,还在跨域评估中展现出更强泛化能力,验证了显式推理步骤对模型鲁棒性的积极贡献。

链接: https://arxiv.org/abs/2604.00013
作者: Miaosen Luo,Zhenhao Yang,Jieshen Long,Jinghu Sun,Yichu Liu,Sijie Mai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end “black-box” nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

[NLP-85] Finding and Reactivating Post-Trained LLM s Hidden Safety Mechanisms

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在经过后训练(post-training)以提升特定任务性能(如推理能力)时所引发的安全性下降问题。研究表明,后训练过程会掩盖基础模型原有的安全机制,并过度放大与训练目标相关的表征,从而导致模型产生更具危害性的行为。解决方案的关键在于:识别出后训练并未彻底移除原始安全机制,而是将其抑制;因此提出一种轻量级、低成本的方法 SafeReAct,通过在少量层上对 LoRA(Low-Rank Adaptation)适配器进行对齐,恢复被抑制的安全行为。实验表明,该方法在不损害推理性能的前提下显著提升了模型安全性,且适用于多种领域专用的大语言模型。

链接: https://arxiv.org/abs/2604.00012
作者: Mingjie Li,Wai Man Si,Michael Backes,Yang Zhang,Yisen Wang
机构: Cranberry-Lemon University (cranberry-lemon大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs’ safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.

[NLP-86] Can LLM s Perceive Time? An Empirical Investigation ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务执行时间估计方面的严重缺陷,即模型无法准确预测自身推理过程所需的时间。研究通过四项实验覆盖68个任务和四种模型家族发现,预判性时间估计普遍高估实际耗时4–7倍(p < 0.001),且在任务排序判断中表现接近随机水平(如GPT-5在反直觉任务对上仅得18%,p = 0.033),表明模型依赖启发式而非真实推理时间。关键发现是:尽管模型具备来自训练数据的命题性时间知识(propositional knowledge about duration),却缺乏对其自身推理延迟的具身化经验(experiential grounding in their own inference time),导致其在多步代理设置中仍存在5–10倍误差。这一认知局限对智能体调度、规划及时间敏感场景具有重要实践影响。

链接: https://arxiv.org/abs/2604.00010
作者: Aniketh Garikaparthi
机构: TCS Research (塔塔咨询研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026 I Can’t Believe It’s Not Better Workshop

点击查看摘要

Abstract:Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4–7 \times ( p 0.001 ), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18% on counter-intuitive pairs, p = 0.033 ), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality – estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5–10 \times . The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.

[NLP-87] Eyla: Toward an Identity-Anchored LLM Architecture with Integrated Biological Priors – Vision Implementation Attempt and Lessons from AI-Assisted Development

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对对抗性压力时难以维持身份一致性(identity consistency)的问题,即模型无法稳定保持自我认知、承认不确定性并抵御操纵。解决方案的关键在于提出一种名为Eyla的身份锚定式LLM架构,其核心是将一系列受生物学启发的子系统——包括HiPPO初始化的状态空间模型、零初始化适配器、情景记忆检索机制以及校准不确定性的训练策略——整合进一个统一的代理操作系统中,并在消费级硬件上运行。该架构通过引入身份一致性评分(Identity Consistency Score, ICS)作为评估指标,试图从结构和训练层面增强模型的内在稳定性与可解释性。

链接: https://arxiv.org/abs/2604.00009
作者: Arif Aditto
机构: Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 tables, 25 references. Preprint under review for workshop submission

点击查看摘要

Abstract:We present the design rationale, implementation attempt, and failure analysis of Eyla, a proposed identity-anchored LLM architecture that integrates biologically-inspired subsystems – including HiPPO-initialized state-space models, zero-initialized adapters, episodic memory retrieval, and calibrated uncertainty training – into a unified agent operating system running on consumer hardware. Unlike existing approaches that optimize models for generic helpfulness, Eyla targets identity consistency: the ability to maintain a coherent self-model under adversarial pressure, admit uncertainty, and resist manipulation. We propose the Identity Consistency Score (ICS), a novel benchmark for evaluating this property across LLMs. We then present an honest account of attempting to implement this architecture using AI coding assistants (Claude Code, Cursor) as a non-programmer, documenting a 1,000+ failure that produced a 1.27B parameter model with 86 brain subsystems contributing less than 2% to output. Our analysis identifies five systematic failure modes of AI-assisted development for novel architectures and offers concrete recommendations. To our knowledge, this is the first paper to combine an architectural vision with a documented first-person failure analysis of AI-assisted LLM development, providing lessons for both the AI systems and AI-assisted software engineering communities.

[NLP-88] How Trustworthy Are LLM -as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

【速读】: 该论文旨在解决当前质性研究中广泛采用大语言模型(Large Language Model, LLM)进行解释性分析时,缺乏系统性评估其解释质量、且模型选择过程未经充分检验的问题。解决方案的关键在于通过对比LLM-as-judge自动化评分与训练有素的人类评审者对712个K-12数学教师访谈片段的解释质量评价(包括解释准确性、细微差异保留度和解释连贯性),验证自动化评估是否能有效反映人类判断趋势并指导模型筛选。结果表明,尽管LLM-as-judge在模型层面能捕捉到人类评价的整体趋势,但在具体评分幅度上存在显著偏差,其中连贯性(Coherence)指标与人类评价一致性最强,而忠实性(Faithfulness)和正确性(Correctness)在非字面意义和复杂语境下出现系统性偏离;安全性相关指标则与解释质量无关。因此,LLM-as-judge更适合用于淘汰表现不佳的模型,而非替代人工判断,为质性研究中LLM的系统比较与选型提供了实证依据。

链接: https://arxiv.org/abs/2604.00008
作者: Songhee Han,Jueun Shin,Jiyoon Han,Bung-Woo Jun,Hilal Ayan Karabatman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock’s LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.

[NLP-89] Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

【速读】: 该论文旨在解决多模态统一建模中异构模态(如文本、图像、语音和视频)难以在单一架构下实现高效理解与生成的问题。现有方法要么采用自回归方式序列化处理不同模态,要么依赖外部专用解码器进行组合式建模,导致效率低下或系统复杂。其解决方案的关键在于提出 Dynin-Omni,一个基于掩码扩散(masked diffusion)的通用多模态基础模型,通过将所有模态映射到共享离散标记空间,并以迭代精炼的方式在双向上下文中进行建模,从而实现任意模态间的端到端联合学习与生成。这一设计突破了传统方法的局限,为实时多模态系统、跨模态检索与生成以及具身多模态智能体提供了灵活且强大的基础。

链接: https://arxiv.org/abs/2604.00007
作者: Jaeik Kim,Woojin Kim,Jihwan Hong,Yejoon Lee,Sieun Hyeon,Mintaek Lim,Yunseok Han,Dogeun Kim,Hoeun Lee,Hyunggeun Kim,Jaeyoung Do
机构: AIDAS Lab (AIDAS 实验室); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

[NLP-90] How Emotion Shapes the Behavior of LLM s and Agents : A Mechanistic Study

【速读】: 该论文旨在解决现有情感感知研究中将情绪仅视为表层风格因素或感知目标,而忽视其在任务处理中的机制性作用的问题。解决方案的关键在于提出E-STEER框架,该框架通过将情绪作为结构化且可控制的变量嵌入到大语言模型(Large Language Models, LLMs)和智能体的隐藏状态中,实现对模型行为的直接表示层面干预,并系统地揭示情绪对客观推理、主观生成、安全性及多步智能体行为的影响机制。

链接: https://arxiv.org/abs/2604.00005
作者: Moran Sun,Tianlin Li,Yuwei Zheng,Zhenhong Zhou,Aishan Liu,Xianglong Liu,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Emotion plays an important role in human cognition and performance. Motivated by this, we investigate whether analogous emotional signals can shape the behavior of large language models (LLMs) and agents. Existing emotion-aware studies mainly treat emotion as a surface-level style factor or a perception target, overlooking its mechanistic role in task processing. To address this limitation, we propose E-STEER, an interpretable emotion steering framework that enables direct representation-level intervention in LLMs and agents. It embeds emotion as a structured, controllable variable in hidden states, and with it, we examine the impact of emotion on objective reasoning, subjective generation, safety, and multi-step agent behaviors. The results reveal non-monotonic emotion-behavior relations consistent with established psychological theories, and show that specific emotions not only enhance LLM capability but also improve safety, and systematically shape multi-step agent behaviors.

[NLP-91] LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文窗口(context window)时,因位置编码(Positional Encoding)尺度调整和轻量级持续预训练(Continual Pre-Training, CPT)导致的原始短文本任务性能下降问题。其核心挑战在于如何在不破坏模型原有能力的前提下,有效恢复长序列处理能力。解决方案的关键是提出LinearARD方法,这是一种基于自蒸馏(self-distillation)的学习策略,通过冻结原生旋转位置编码(Rotary Position Embeddings, RoPE)教师模型,引导RoPE缩放的学生模型在注意力结构上保持一致性;具体而言,LinearARD不直接匹配隐状态,而是对密集的Q/Q、K/K和V/V自关系矩阵的行分布进行对齐,从而显式监督注意力动态过程,并引入线性内存核以克服传统n×n关系图的二次内存瓶颈,实现高效且精确的KL散度计算与梯度传播。该方法仅需4.25M训练样本即可显著恢复短文本性能(达98.3%基准水平),同时优于现有方法在长上下文任务上的表现。

链接: https://arxiv.org/abs/2604.00004
作者: Ning Yang,Hengyu Zhong,Wentao Wang,Baoliang Tian,Haijun Zhang,Jun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense Q/Q , K/K , and V/V self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of n \times n relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf4.25M training tokens compared to the \textbf256M tokens required by LongReD and CPT. Our code is available at this https URL.

[NLP-92] Benchmark for Assessing Olfactory Perception of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在嗅觉感知推理能力方面的评估与提升问题,即如何衡量LLMs对气味信息的理解和预测能力,并探索其潜在的知识获取机制。解决方案的关键在于构建了一个名为Olfactory Perception (OP)的基准测试集,包含1,010个跨八类任务的问题,涵盖气味分类、主描述符识别、强度与愉悦度判断、多描述符预测、混合物相似性判断、嗅觉受体激活预测及真实气味源识别等;同时采用两种分子表示形式(化合物名称和同分异构SMILES)进行对比实验,发现当前LLMs主要依赖词汇关联而非结构化分子推理来处理嗅觉信息,且通过多语言预测集成可显著提升性能(AUROC = 0.86),表明LLMs应具备处理嗅觉信息的能力,而不仅限于视觉或听觉模态。

链接: https://arxiv.org/abs/2604.00002
作者: Eftychia Makri,Nikolaos Nakis,Laura Sisson,Gigi Minsky,Leandros Tassiulas,Vahid Satarifard,Nicholas A. Christakis
机构: Yale University (耶鲁大学); University of California (加州大学); Patina (帕蒂娜)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

[NLP-93] wo-Stage Optimizer-Aware Online Data Selection for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在线微调中数据选择与重加权的问题,尤其针对数据序列到达、样本效用依赖于训练步骤以及自适应优化器影响有效更新几何结构的场景。现有方法多适用于离线设置,在线场景下表现不佳。其解决方案的关键在于提出一种优化器感知(optimizer-aware)的数据选择与重加权框架,将在线选择视为在当前优化器状态下塑造目标导向更新的过程,而非静态样本排序;通过构建一个优化器感知的更新匹配问题,将其与二阶目标效用关联,并强调子集级构造需考虑所选样本间的交互与冗余性。基于此,作者设计了两阶段“筛选-加权”算法(Filter-then-Weight),先几何筛选有用候选样本,再优化其权重系数,同时引入因子分解的外积梯度表示和高效矩阵运算以适配长上下文数据,从而显著提升收敛速度与下游任务性能。

链接: https://arxiv.org/abs/2604.00001
作者: Fangxin Wang,Peyman Baghershahi,Langzhou He,Henry Peng Zou,Sourav Medya,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

[NLP-94] An Empirical Recipe for Universal Phone Recognition INTERSPEECH2026

【速读】: 该论文旨在解决多语言语音识别(Phone Recognition, PR)中模型泛化能力弱的问题,特别是现有以英语为主的高性能模型在跨语言场景下表现不佳,以及多语言模型未能有效利用预训练自监督学习(Self-Supervised Learning, SSL)表征的局限性。其解决方案的关键在于提出 PhoneticXEUS——一个基于大规模多语言数据训练的语音识别模型,通过系统性控制实验(ablation study)量化了SSL表征、数据规模和损失函数目标对多语言性能的影响,并在统一评估框架下验证了其优越性(多语言场景下PFE率17.7%,带口音英语场景下PFE率10.6%),同时分析了不同语系、口音及发音特征下的错误模式,为构建鲁棒多语言语音处理系统提供了可复现的训练范式。

链接: https://arxiv.org/abs/2603.29042
作者: Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen
机构: Carnegie Mellon University (卡内基梅隆大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026. Code: this https URL

点击查看摘要

Abstract:Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS – trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

信息检索

[IR-0] ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

【速读】:该论文旨在解决复杂深度研究任务中训练数据集构建的难题,特别是针对需要多步检索与推理的查询场景,传统方法依赖昂贵的人工标注或复杂的前置条件,难以规模化。其解决方案的关键在于提出一种轻量级(frugal)合成数据生成框架 ORBIT,该框架通过四个模块化阶段完成高质量训练样本的自动化生成:种子创建、问答对生成及双重验证(自验证与外部验证),无需付费API服务即可获得20K条需4–5步推理且答案可短时验证的高质量样本,覆盖15个领域。实验表明,基于ORBIT训练的Qwen3-4B模型在维基百科问答任务上表现优异,验证了合成数据的有效性与实用性。

链接: https://arxiv.org/abs/2604.01195
作者: Nandan Thakur,Zijian Chen,Xueguang Ma,Jimmy Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question–answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4–5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

[IR-1] From Validity to Inter-Subjectivity: An Argument for Reliability Signals in Search Environments

【速读】:该论文旨在解决搜索环境中的认知论(epistemic)挑战,即传统以验证信息真伪为中心的应对策略在复杂信息传播场景中存在局限性。其解决方案的关键在于摒弃单纯依赖事实核查的路径,转而从信息检索系统如何塑造用户认知过程的角度出发,重新审视搜索引擎和信息平台在误导性内容扩散中的结构性作用。

链接: https://arxiv.org/abs/2604.01186
作者: Frans van der Sluis
机构: 未知
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 4 pages. Extended abstract / conference paper for SEASON 2025 (September 24-25, 2025, Hamburg, Germany). Peer reviewed

点击查看摘要

Abstract:Search engines and information platforms are increasingly scrutinized for their role in spreading misinformation. Traditional responses often focus on detecting falsehoods or verifying the ultimate validity of claims. This paper argues that such a validity-centered framing is inadequate for the epistemic challenges of search environments.

[IR-2] Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics

【速读】:该论文旨在解决“作者是否在作品的信息论新颖性曲线中具有可识别的个体特征指纹”这一问题。解决方案的关键在于利用两个大规模语料库(Books3 和 PG-19)中的文本,通过多尺度分析方法提取并量化作者特有的新颖性动态模式:在书籍层面,基于标量动力学特征(如平均新颖度、速度、体量和迂回度)可显著高于随机水平地识别出43%的作者;在章节层面,则采用滑动窗口内的SAX(Symbolic Aggregate approXimation) motif模式识别技术,实现比随机水平高30倍的归属准确率,远超书籍层面的标量特征。这些不同尺度的信号互补而非冗余,表明作者语音可通过新颖性演化轨迹被有效捕捉,且该现象在一定程度上独立于体裁影响,尤其在约四分之一作者中保持跨体裁稳定性。

链接: https://arxiv.org/abs/2604.01073
作者: Fred Zimmerman,Hilmar AI
机构: Nimble Books LLC (Nimble Books LLC); Hilmar AI (Hilmar AI)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:We test whether authors have characteristic “fingerprints” in the information-theoretic novelty curves of their published works. Working with two corpora – Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) – we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

[IR-3] Aligning Recommendations with User Popularity Preferences

【速读】:该论文旨在解决推荐系统中存在的流行度偏差(popularity bias)问题,即推荐结果过度偏向热门物品,导致“富者愈富”的动态和内容同质化,同时使推荐与用户个性化偏好(如对热门或小众内容的倾向)产生偏离。其解决方案的关键在于引入“流行度分位数校准”(Popularity Quantile Calibration)这一测量框架,量化用户历史行为中对流行度的偏好与其推荐结果之间的偏差,并在此基础上提出SPREE方法——一种基于激活引导(activation steering)的推理阶段缓解策略。SPREE通过在表示空间中识别流行度方向,并根据每个用户的个人流行度偏差估计自适应地调整模型激活,实现对不同用户差异化、可变方向与强度的校准,从而提升用户-推荐器对齐度,而非采用全局统一的去偏方式。

链接: https://arxiv.org/abs/2604.01036
作者: Mona Schirmer,Anton Thielmann,Pola Schwöbel,Thomas Martynec,Giuseppe Di Benedetto,Ben London,Yannik Stein
机构: University of Amsterdam (阿姆斯特丹大学); Amazon Music (亚马逊音乐)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at FAccT 2026

点击查看摘要

Abstract:Popularity bias is a pervasive problem in recommender systems, where recommendations disproportionately favor popular items. This not only results in “rich-get-richer” dynamics and a homogenization of visible content, but can also lead to misalignment of recommendations with individual users’ preferences for popular or niche content. This work studies popularity bias through the lens of user-recommender alignment. To this end, we introduce Popularity Quantile Calibration, a measurement framework that quantifies misalignment between a user’s historical popularity preference and the popularity of their recommendations. Building on this notion of popularity alignment, we propose SPREE, an inference-time mitigation method for sequential recommenders based on activation steering. SPREE identifies a popularity direction in representation space and adaptively steers model activations based on an estimate of each user’s personal popularity bias, allowing both the direction and magnitude of steering to vary across users. Unlike global debiasing approaches, SPREE explicitly targets alignment rather than uniformly reducing popularity. Experiments across multiple datasets show that SPREE consistently improves user-level popularity alignment while preserving recommendation quality.

[IR-4] Doctor-RAG : Failure-Aware Repair for Agent ic Retrieval-Augmented Generation

【速读】:该论文旨在解决Agentic Retrieval-Augmented Generation (Agentic RAG) 在多跳问答和复杂知识推理任务中因推理路径变长而导致的失败频发问题。现有方法通常通过诊断分析后停止或重新运行整个检索-推理流水线来处理失败,这导致计算开销大且存在冗余推理。其解决方案的关键在于提出一种统一的“诊断与修复”框架——Doctor-RAG (DR-RAG),该框架将失败处理分解为两个阶段:(i) 轨迹级失败诊断与定位,基于覆盖门控分类法(coverage-gated taxonomy)识别推理轨迹中最早的失败点;(ii) 工具条件下的局部修复,仅在诊断出的失败点进行干预,并最大程度复用已验证的推理前缀和检索证据。通过显式分离错误归因与修正过程,DR-RAG实现了精准定位与最小代价干预,显著提升了答案准确性并降低了推理Token消耗。

链接: https://arxiv.org/abs/2604.00865
作者: Shuguang Jiao,Chengkai Huang,Shuhan Qi,Xuan Wang,Yifan Li,Lina Yao
机构: Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳); Macquarie University and UNSW (麦考瑞大学和新南威尔士大学); UNSW and CSIRO’s Data61 (新南威尔士大学和澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Agentic Retrieval-Augmented Generation (Agentic RAG) has become a widely adopted paradigm for multi-hop question answering and complex knowledge reasoning, where retrieval and reasoning are interleaved at inference time. As reasoning trajectories grow longer, failures become increasingly common. Existing approaches typically address such failures by either stopping at diagnostic analysis or rerunning the entire retrieval-reasoning pipeline, which leads to substantial computational overhead and redundant reasoning. In this paper, we propose Doctor-RAG (DR-RAG), a unified diagnose-and-repair framework that corrects failures in Agentic RAG through explicit error localization and prefix reuse, enabling minimal-cost intervention. DR-RAG decomposes failure handling into two consecutive stages: (i) trajectory-level failure diagnosis and localization, which attributes errors to a coverage-gated taxonomy and identifies the earliest failure point in the reasoning trajectory; and (ii) tool-conditioned local repair, which intervenes only at the diagnosed failure point while maximally reusing validated reasoning prefixes and retrieved evidence. By explicitly separating error attribution from correction, DR-RAG enables precise error localization, thereby avoiding expensive full-pipeline reruns and enabling targeted, efficient repair. We evaluate DR-RAG across three multi-hop question answering benchmarks, multiple agentic RAG baselines, and different backbone models. Experimental results demonstrate that DR-RAG substantially improves answer accuracy while significantly reducing reasoning token consumption compared to rerun-based repair strategies.

[IR-5] Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

【速读】:该论文旨在解决多对象图像场景下的人类在环路目标检索(Human-in-the-Loop Object Retrieval)问题,即从大规模未标注图像集合中,仅依赖初始查询和用户相关性反馈(Relevance Feedback),快速识别出目标类别多样化的图像实例。其核心挑战在于:在复杂背景中定位小区域的目标对象,传统全局描述符难以有效捕捉局部细粒度特征。解决方案的关键在于利用预训练视觉Transformer(ViT)的表示能力,系统性地探讨四个关键设计问题——如何选择图像中的目标实例、标注形式的设计、主动学习(Active Learning)的选择策略以及表征方法的优化,从而在全局上下文与局部对象细节之间取得平衡,提升交互式检索性能。

链接: https://arxiv.org/abs/2604.00809
作者: Kawtar Zaher,Olivier Buisson,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user’s Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object’s features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

[IR-6] A novel three-step approach to forecast firm-specific technology convergence opportunity via multi-dimensional feature fusion

【速读】:该论文旨在解决现有技术融合(Technology Convergence, TC)研究中普遍存在的两个问题:一是多数研究聚焦于行业层面的TC预测,缺乏针对企业特定技术机会发现(Firm-specific Technology Opportunity Discovery, TOD)的精准预测方法;二是尽管专利文档蕴含丰富的计量指标、网络结构和文本特征,但现有研究通常仅利用其中一到两个维度,未能有效融合多维特征以提升预测性能。解决方案的关键在于提出一种三步集成方法:首先,在国际专利分类(IPC)对层面融合专利的计量、网络结构与文本特征,并引入注意力机制增强特征表达;其次,基于两阶段集成学习模型结合多种不平衡处理策略识别IPC层级的TC机会;最后,通过检索增强生成(Retrieval-Augmented Generation, RAG)技术与大语言模型(Large Language Model, LLM)对主题级TC机会进行评估,从而获得可落地的企业级TC机会。该方法显著提升了TC预测在企业技术战略决策中的适用性与准确性。

链接: https://arxiv.org/abs/2604.00803
作者: Fu Gu,Ao Chen,Yingwen Wu
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:As a crucial innovation paradigm, technology convergence (TC) is gaining ever-increasing attention. Yet, existing studies primarily focus on predicting TC at the industry level, with little attention paid to TC forecast for firm-specific technology opportunity discovery (TOD). Moreover, although technological documents like patents contain a rich body of bibliometric, network structure, and textual features, such features are underexploited in the extant TC predictions; most of the relevant studies only used one or two dimensions of these features, and all the three dimensional features have rarely been fused. Here we propose a novel approach that fuses multi-dimensional features from patents to predict TC for firm-specific TOD. Our method comprises three steps, which are elaborated as follows. First, bibliometric, network structure, and textual features are extracted from patent documents, and then fused at the International Patent Classification (IPC)-pair level using attention mechanisms. Second, IPC-level TC opportunities are identified using a two-stage ensemble learning model that incorporates various imbalance-handling strategies. Third, to acquire feasible firm-specific TC opportunities, the performance metrics of topic-level TC opportunities, which are refined from IPC-level opportunities, are evaluated via retrieval-augmented generation (RAG) with a large language model (LLM). We prove the effectiveness of our proposed approach by predicting TC opportunities for a leading Chinese auto part manufacturer, Zhejiang Sanhua Intelligent Controls co., ltd, in the domains of thermal management for energy storage and robotics. In sum, this work advances the theory and applicability of forecasting firm-specific TC opportunity through fusing multi-dimensional features and leveraging LLM-as-a-judge for technology opportunity evaluation.

[IR-7] STCALIR: Semi-Synthetic Test Collection for Algerian Legal Information Retrieval

【速读】:该论文旨在解决在低资源法律领域(如阿尔及利亚法律文本)中构建高质量测试集合的难题,因人工标注成本高昂且相关语料与相关性判断稀缺。解决方案的关键在于提出STCALIR框架,该框架基于Cranfield范式,通过自动化多阶段检索与过滤流程生成半合成测试集合,在保持主题、语料库和相关性判断三大核心要素的同时,实现了99%的标注工作量减少。实验证明,该方法生成的相关性判断在检索效果上可媲美人工标注(Hit@10 ≈ 0.785),且系统级排序与人工评估具有高度一致性(Kendall’s τ = 0.89,Spearman’s ρ = 0.92),从而为低资源法律领域提供了可复现且高效可靠的测试集合构建方案。

链接: https://arxiv.org/abs/2604.00731
作者: M’hamed Amine Hatem,Sofiane Batata,Amine Mammasse,Faiçal Azouaou
机构: École supérieure en Sciences et Technologies de l’Informatique et du Numérique (高等信息与数字技术科学学院); École nationale Supérieure d’Informatique ESI (国家计算机高级学院)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Test collections are essential for evaluating retrieval and re-ranking models. However, constructing such collections is challenging due to the high cost of manual annotation, particularly in specialized domains like Algerian legal texts, where high-quality corpora and relevance judgments are scarce. To address this limitation, we propose STCALIR, a framework for generating semi-synthetic test collections directly from raw legal documents. The pipeline follows the Cranfield paradigm, maintaining its core components of topics, corpus, and relevance judgments, while significantly reducing manual effort through automated multi-stage retrieval and filtering, achieving a 99% reduction in annotation workload. We validate STCALIR using the Mr. TyDi benchmark, demonstrating that the resulting semi-synthetic relevance judgments yield retrieval effectiveness comparable to human-annotated evaluations (Hit@10 \approx 0.785). Furthermore, system-level rankings derived from these labels exhibit strong concordance with human-based evaluations, as measured by Kendall’s \tau (0.89) and Spearman’s \rho (0.92). Overall, STCALIR offers a reproducible and cost-efficient solution for constructing reliable test collections in low-resource legal domains.

[IR-8] Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

【速读】:该论文旨在从统计学角度解释TF-IDF(词频-逆文档频率,Term Frequency-Inverse Document Frequency)这一经典词权重计算方法的内在机制,并探索其性能表现背后的理论基础。传统TF-IDF被广泛用于文本挖掘和文档分类任务中,但其理论依据并不清晰。论文提出一个基于惩罚似然比检验(penalized likelihood-ratio test)的框架,将词频分布建模为具有伽马先验的贝塔-二项分布(beta-binomial distribution),从而捕捉词语的突发性(burstiness,即词频在不同文档间波动显著,也称过离散性)。该框架下,备择假设引入对精度参数的伽马惩罚项以刻画词频波动,而零假设则采用简单的二项分布模型(忽略突发性)。通过推导该检验统计量所对应的词权重公式,论文发现其与TF-IDF在文档分类任务中表现相当,从而为TF-IDF提供了新的统计解释,并揭示了假设检验框架在开发新型词权重方案中的潜力。

链接: https://arxiv.org/abs/2604.00672
作者: Zeyad Ahmed,Paul Sheridan,Michael McIsaac,Aitazaz A. Farooque
机构: University of Prince Edward Island (爱德华王子岛大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)
备注: 27 pages, 3 tables, 7 figures, accepted in Discover Computing 2026

点击查看摘要

Abstract:TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

[IR-9] UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems

【速读】:该论文旨在解决推荐系统中模型缩放效率低下以及主流缩放架构(基于注意力机制、TokenMixer和因子分解机)缺乏统一理论框架的问题。其核心解决方案是提出一种统一的缩放架构 UniMixer,通过将规则化的 TokenMixer 转换为可参数化的特征混合模块,使 token 混合模式能够在训练过程中被优化和学习;同时,该设计消除了 TokenMixer 中“头数必须等于 token 数”的约束,并构建了一个统一的模块设计框架,连接了三种主流缩放方法。为进一步提升缩放投资回报率(ROI),作者还设计了轻量级版本 UniMixing-Lite,在显著提升性能的同时压缩模型参数与计算开销。

链接: https://arxiv.org/abs/2604.00590
作者: Mingming Ha,Guanchen Wang,Linxun Chen,Xuan Rao,Yuexin Shi,Tianbao Ma,Zhaojie Liu,Yunqian Fan,Zilong Lu,Yanan Niu,Han Li,Kun Gai
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the scaling laws of recommendation models have attracted increasing attention, which govern the relationship between performance and parameters/FLOPs of recommenders. Currently, there are three mainstream architectures for achieving scaling in recommendation models, namely attention-based, TokenMixer-based, and factorization-machine-based methods, which exhibit fundamental differences in both design philosophy and architectural structure. In this paper, we propose a unified scaling architecture for recommendation systems, namely \textbfUniMixer, to improve scaling efficiency and establish a unified theoretical framework that unifies the mainstream scaling blocks. By transforming the rule-based TokenMixer to an equivalent parameterized structure, we construct a generalized parameterized feature mixing module that allows the token mixing patterns to be optimized and learned during model training. Meanwhile, the generalized parameterized token mixing removes the constraint in TokenMixer that requires the number of heads to be equal to the number of tokens. Furthermore, we establish a unified scaling module design framework for recommender systems, which bridges the connections among attention-based, TokenMixer-based, and factorization-machine-based methods. To further boost scaling ROI, a lightweight UniMixing module is designed, \textbfUniMixing-Lite, which further compresses the model parameters and computational cost while significantly improve the model performance. The scaling curves are shown in the following figure. Extensive offline and online experiments are conducted to verify the superior scaling abilities of \textbfUniMixer.

[IR-10] MOON3.0: Reasoning -aware Multimodal Representation Learning for E-commerce Product Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在电商产品表征学习中难以显式建模细粒度属性的问题。现有方法通常将MLLMs作为特征提取器,隐式编码全局嵌入,导致对局部细节信息的捕捉能力受限。为应对这一挑战,论文提出MOON3.0,其核心解决方案包括:(1) 设计多头模态融合模块以自适应整合原始输入信号;(2) 引入联合对比学习与强化学习框架,自主探索更有效的推理策略;(3) 提出细粒度残差增强模块,在网络前向传播过程中逐步保留局部细节信息。该方案显著提升了模型在零样本场景下对细粒度产品属性的理解能力。

链接: https://arxiv.org/abs/2604.00513
作者: Junxian Wu,Chenghan Fu,Zhanheng Nie,Daoze Zhang,Bowen Wan,Wanxian Guan,Chuan Yu,Jian Xu,Bo Zheng
机构: Alibaba Group(阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model’s attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

[IR-11] Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

【速读】:该论文旨在解决结构化文档在检索索引过程中因元素级分割导致语义不完整的问题,即当前方法将表格、图表、公式与其上下文文本分开处理,造成语义连贯单元被分散到多个独立检索候选中。其解决方案的关键在于提出一种与解析器无关的流水线,构建“证据单元”(Evidence Units, EUs),通过四个核心贡献实现:(1) 基于本体的角色归一化扩展DoCO标准,统一不同解析器输出的语义结构;(2) 采用基于全相似度矩阵的全局分配算法,最优地将段落分配至EUs;(3) 在Neo4j图数据库中设计决策层,以规则形式定义EU构造逻辑并利用两个不变量验证完整性;(4) 跨解析器验证表明,EUs的空间范围在MinerU和Docling间收敛,且性能增益在解析器引起的边界框差异下保持稳定。实验证明,该方法显著提升检索指标,如LCS从0.50提升至0.81,Recall@1从0.15提升至0.51(3.4倍)。

链接: https://arxiv.org/abs/2604.00500
作者: Yeonjee Han
机构: KT (Korea Telecom)
类目: Information Retrieval (cs.IR)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Structured documents–tables paired with captions, figures with explanations, equations with the paragraphs that interpret them–are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness through two invariants; and (4) cross-parser validation showing EU spatial footprints converge across MinerU and Docling, with gains preserved under parser-induced bbox variance. Experiments on OmniDocBench v1.0 (1,340 pages; 1,551 QA pairs) show EU-based chunking improves retrieval LCS by +0.31 (0.50 to 0.81). Recall@1 increases from 0.15 to 0.51 (3.4x) and MinK decreases from 2.58 to 1.72. Cross-parser results confirm the gain (LCS +0.23 to +0.31) is preserved across parsers. Text queries show the most dramatic gain: Recall@1 rises from 0.08 to 0.47.

[IR-12] FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

【速读】:该论文旨在解决传统文档检索系统在提供细粒度证据线索(如特定相关片段)方面的不足,同时避免在检索后使用大语言模型(Large Language Model, LLM)带来的显著计算开销和部署限制。其解决方案的关键在于提出FGR-ColBERT,即对ColBERT检索模型的改进版本,通过将LLM蒸馏出的细粒度相关性信号直接整合进检索函数中,从而在保持高检索效果(99%相对Recall@50)的同时,实现更高效的推理(仅增加约1.12倍延迟),并在MS MARCO数据集上以110M参数量达到64.5的token级F1,超越了27B参数的Gemma 2模型(62.8)。

链接: https://arxiv.org/abs/2604.00242
作者: Antonín Jarolím,Martin Fajčík
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document retrieval identifies relevant documents but does not provide fine-grained evidence cues, such as specific relevant spans. A possible solution is to apply an LLM after retrieval; however, this introduces significant computational overhead and limits practical deployment. We propose FGR-ColBERT, a modification of ColBERT retrieval model that integrates fine-grained relevance signals distilled from an LLM directly into the retrieval function. Experiments on MS MARCO show that FGR-ColBERT (110M) achieves a token-level F1 of 64.5, exceeding the 62.8 of Gemma 2 (27B), despite being approximately 245 times smaller. At the same time, it preserves retrieval effectiveness (99% relative Recall@50) and remains efficient, incurring only a ~1.12x latency overhead compared to the original ColBERT.

[IR-13] Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的招聘工具在人员选拔中难以识别和优先排序与具体职位 requisition (req) 相关的个人能力(Personal Competencies, PCs)的问题,这些能力往往超越了通用岗位类别而决定候选人的成功表现。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的方法,其核心包括动态少样本提示(dynamic few-shot prompting)、基于反思的自我改进机制(reflection-based self-improvement)、基于相似度的过滤策略(similarity-based filtering)以及多阶段验证(multi-stage validation),从而实现对 req-specific PCs 的精准识别与排序,在实际应用中达到了平均准确率 0.76,接近人类专家间的一致性水平,并保持较低的偏离范围率(0.07)。

链接: https://arxiv.org/abs/2604.00006
作者: Wanxin Li,Denver McNeney,Nivedita Prabhu,Charlene Zhang,Renee Barr,Matthew Kitching,Khanh Dao Duc,Anthony S. Boyce
机构: University of British Columbia (不列颠哥伦比亚大学); Amazon
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.

[IR-14] A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

【速读】:该论文旨在解决从基于文本的学术文档(如KRS文档)中高效且可靠地提取信息的问题,尤其是在计算资源受限环境下的应用挑战。其解决方案的关键在于融合确定性规则方法与大语言模型(LLM)的优势,提出三种策略:纯LLM方法、基于正则表达式与LLM结合的混合方法(Hybrid Deterministic - LLM),以及以Camelot为基础的管道流程并辅以LLM回退机制。实验表明,混合方法在处理结构化元数据时效率更高,而Camelot+LLM回退方案在准确率(EM和LS达0.99–1.00)和计算效率(多数情况下每份PDF处理时间低于1秒)之间取得了最佳平衡,验证了多范式集成方法在资源受限场景下对信息抽取任务的可靠性与实用性提升。

链接: https://arxiv.org/abs/2604.00003
作者: Muhammad Anis Al Hilmi,Neelansh Khare,Noel Framil Iglesias
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic - LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text based academic documents in computationally constrained environments.

人机交互

[HC-0] Assessing Affective Objectives for Communicative Visualizations

【速读】:该论文旨在解决在设计具有情感目标(affective objectives)的沟通型可视化时,缺乏有效评估工具的问题。现有方法对认知目标(cognitive objectives)已有大量验证良好的评估手段,但情感目标因其内在主观性和复杂性难以量化。解决方案的关键在于构建一套跨学科的评估标准框架,整合来自教育学、倡导、经济学、健康和心理学领域的评估工具,从而能够系统性地衡量可视化在激发态度变化、情感共鸣等非认知效果上的表现。该框架在结合个人叙事与可视化设计的复杂任务中得以验证,使不同设计方案能够在既定情感目标和竞争性心理理论背景下进行比较与优化。

链接: https://arxiv.org/abs/2604.01183
作者: Elsie Lee-Robbins,Eytan Adar
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Using learning objectives to define designer intents for communicative visualizations can be a powerful design tool. Cognitive and affective objectives are concrete and specific, which can be translated to assessments when creating, evaluating, or comparing visualization ideas. However, while there are many well-validated assessments for cognitive objectives, affective objectives are uniquely challenging. It is easy to see if a visualization helps someone remember the number of patients in a clinic, but harder to observe the change in their attitudes around donations to a crisis. In this work, we define a set of criteria for selecting assessments–from education, advocacy, economics, health, and psychology–that align with affective objectives. We illustrate the use of the framework in a complex affective design task that combines personal narratives and visualizations. Our chosen assessments allow us to evaluate different designs in the context of our objectives and competing psychological theories.

[HC-1] rue (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality Rhetoric and Misleadingness in Visualization Lies

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在识别、解释误导性可视化图表方面的能力问题,特别是其能否准确捕捉视觉误导的表征、背后成因及潜在意图。解决方案的关键在于构建一个融合可视化修辞(visualization rhetoric)理论与新型作者意图分类体系的分析框架,并基于包含2,336条新冠相关推文(半数含误导性可视化)的数据集开展实验验证;同时引入VisLies社区案例和专家用户研究以增强评估的现实性和可比性,从而系统评估16种前沿MLLMs(涵盖从12B到1000B参数规模的不同架构)的表现,揭示模型判断与人类专家的一致性与差异性,为提升生成式AI在可视化批判性认知中的可靠性提供实证依据。

链接: https://arxiv.org/abs/2604.01181
作者: Graziano Blasilli,Marco Angelini
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

[HC-2] rust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育场景中应用时,学生对AI助手的信任如何影响其在编程问题解决任务中的适当依赖行为这一核心问题。研究发现,学生对AI的信任与适当依赖之间存在非线性关系:信任水平过高反而导致对AI建议的批判性判断能力下降,即难以区分正确与错误的推荐;而这一关系受到学生AI素养(AI literacy)和认知需求(need for cognition)的显著调节。解决方案的关键在于设计能够促进学生对AI辅助信息进行反思性评估的教学与系统支持机制,从而引导其形成更合理的使用策略,避免盲目依赖。

链接: https://arxiv.org/abs/2604.01114
作者: Griffin Pitts,Neha Rani,Weedguet Mildort
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: Full paper accepted to the 27th International Conference on AI in Education (AIED 2026). AIED Proceedings to be released Summer 2026

点击查看摘要

Abstract:As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learning tasks, either by requesting help or through integrated tools. Trust in AI can influence how students interpret and use that output, including whether they evaluate it critically or exhibit overreliance. We investigate how students’ trust relates to their appropriate reliance on an AI assistant during programming problem-solving tasks, and whether this relationship differs by learner characteristics. With 432 undergraduate participants, students’ completed Python output-prediction problems while receiving recommendations and explanations from an AI chatbot, including accurate and intentionally misleading suggestions. We operationalize reliance behaviorally as the extent to which students’ responses reflected appropriate use of the AI assistant’s suggestions, accepting them when they were correct and rejecting them when they were incorrect. Pre- and post-task surveys assessed trust in the assistant, AI literacy, need for cognition, programming self-efficacy, and programming literacy. Results showed a non-linear relationship in which higher trust was associated with lower appropriate reliance, suggesting weaker discrimination between correct and incorrect recommendations. This relationship was significantly moderated by students’ AI literacy and need for cognition. These findings highlight the need for future work on instructional and system supports that encourage more reflective evaluation of AI assistance during problem-solving.

[HC-3] FlexAI: A Multi-modal Solution for Delivering Personalized and Adaptive Fitness Interventions

【速读】:该论文旨在解决当前健身解决方案缺乏实时、自适应反馈的问题,这些问题通常仅依赖静态训练计划,无法根据用户的生理状态(如疼痛阈值、疲劳水平或动作姿势)进行动态调整。其核心解决方案是提出FlexAI系统,该系统融合计算机视觉、生理传感器(心率与语音)以及大语言模型(Large Language Models, LLMs)的推理能力,实现对用户运动强度、休息时间及动机水平的实时个性化干预。关键创新在于多模态感知与LLM驱动决策的协同,使系统能够持续监测用户状态并生成可靠、个性化的指导策略,从而显著提升用户体验和训练效果。

链接: https://arxiv.org/abs/2604.00968
作者: Shivangi Agarwal,Zoya Ghoshal,Bharat Jain,Siddharth Siddharth
机构: HTI Lab, Plaksha University (普拉克沙大学); Plaksha University (普拉克沙大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Personalization of exercise routines is a crucial factor in helping people achieve their fitness goals. Despite this, many contemporary solutions fail to offer real-time, adaptive feedback tailored to an individual’s physiological states. Contemporary fitness solutions often rely only on static plans and do not adjust to factors such as a user’s pain thresholds, fatigue levels, or form during a workout routine. This work introduces FlexAI, a multi-modal system that integrates computer vision, physiological sensors (heart rate and voice), and the reasoning capabilities of Large Language Models (LLMs) to deliver real-time, personalized workout guidance. FlexAI continuously monitors a user’s physical form and level of exertion, among other parameters, to provide dynamic interventions focused on exercise intensity, rest periods, and motivation. To validate our system, we performed a technical evaluation confirming our models’ accuracy and quantifying pipeline latency, alongside an expert review where certified trainers validated the correctness of the LLM’s interventions. Furthermore, in a controlled study with 25 participants, FlexAI demonstrated significant improvements over a static, non-adaptive control system. With FlexAI, users reported significantly greater enjoyment, a stronger sense of achievement, and significantly lower levels of boredom and frustration. These results indicate that by integrating multi-modal sensing with LLM-driven reasoning, adaptive systems like FlexAI can create a more engaging and effective workout experience. Our work provides a blueprint for integrating multi-modal sensing with LLM-driven reasoning, demonstrating that it is possible to create adaptive coaching systems that are not only more engaging but also demonstrably reliable.

[HC-4] AuraDesk: Data Physicalization through Olfaction Metaphors for Representing and Mitigating Workplace Stress

【速读】:该论文旨在解决职场压力管理中传统视觉或听觉干预手段易与注意力竞争、引发感官过载的问题,探索嗅觉作为替代性环境媒介来呈现与压力相关的生理信号。其解决方案的关键在于提出AuraDesk系统,该系统通过可穿戴设备获取连续生理数据,并结合局部生理状态推断与受限的气味执行策略,实现时间上受控、空间上局域化的嗅觉表达,从而在日常办公环境中提供不干扰专注力的氛围式反馈。

链接: https://arxiv.org/abs/2604.00869
作者: Siying Hu,Zhenhao Zhang
机构: The University of Queensland (昆士兰大学); University of Hong Kong (香港大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Workplace stress is often addressed through visual or auditory interventions, yet these modalities can compete with attention and contribute to sensory overload. We explore olfaction as an alternative ambient medium for representing stress-related physiological signals in office settings. We present AuraDesk, an olfactory data physicalization system that translates wearable-derived physiological cues into situated scent expressions at the workstation. The system combines local physiological state inference with a constrained actuation strategy to produce temporally regulated and spatially localized scent output suitable for everyday work environments. To examine the feasibility and experiential qualities of this approach, we conducted a one-day in-situ field deployment with 25 knowledge workers at their actual workstations. Our findings show that participants often interpreted the scent output not as an explicit alert, but as a subtle atmospheric cue that supported momentary awareness, micro-break taking, and perceived environmental attunement. At the same time, participants raised important concerns regarding scent preference, habituation, and contextual appropriateness in shared offices. This work contributes (1) an olfactory interface for physiologically driven ambient feedback in the workplace, (2) a hybrid mapping approach for coupling continuous biosignal interpretation with constrained scent actuation, and (3) empirical insights into how workers perceive, negotiate, and appropriate ambient olfactory feedback in real office contexts. Rather than claiming therapeutic efficacy, we position AuraDesk as a probe into the design space of olfactory data physicalization for workplace wellbeing and attention-sensitive interaction.

[HC-5] Evaluating the Feasibility of Augmented Reality to Support Communication Access for Deaf Students in Experiential Higher Education Contexts

【速读】:该论文旨在解决聋人及听力障碍(Deaf and Hard of Hearing, DHH)学生在高等教育实验性学习环境(如实验室)中因传统无障碍服务(如口译和字幕)导致的注意力分散、安全风险增加及认知负荷过高的问题。现有服务要求DHH学生在关键任务、安全提示、教学材料与辅助人员之间分配注意力,从而影响学习效果和安全性。解决方案的关键在于引入基于增强现实(Augmented Reality, AR)智能眼镜的“教育增强现实实时无障碍访问系统”(Augmented Reality Real-Time Access for Education, ARRAE),通过将口译员或字幕直接投射至学生视野中,实现无障碍信息与操作任务的无缝融合,从而不牺牲安全性和理解力的前提下提升沟通效率。实证研究显示,AR媒介可显著改善视觉注意力分配并降低感知认知负荷,但其成功应用依赖于对具体情境的适应性设计以及对显示位置、视觉疲劳和助听设备兼容性等关键工程因素的优化。

链接: https://arxiv.org/abs/2604.00856
作者: Roshan Mathew,Roshan L. Peiris
机构: Rochester Institute of Technology (罗切斯特理工学院); Accessible and Immersive Realities (AIR) Lab (可访问与沉浸式现实实验室)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Deaf and hard of hearing (DHH) students often experience communication barriers in higher education, which are particularly acute in experiential learning environments such as laboratories. Traditional accessibility services, such as interpreting and captioning, often require DHH students to divide their attention between critical tasks, potential safety hazards, instructional materials, and access providers, creating trade-offs between safety and equitable communication. These demands can disrupt task engagement and increase cognitive load in settings that require sustained visual focus, highlighting the limitations of current approaches. To address these challenges, this study investigates Augmented Reality Real-Time Access for Education (ARRAE), an ecosystem based on augmented reality (AR) smart glasses, as a potential intervention for laboratory-based environments. By overlaying interpreters or captions directly into a student’s field of view, AR enables the integration of accessibility into hands-on learning without compromising safety or comprehension. Through an empirical study with 12 DHH participants, we evaluate how AR-mediated access influences visual attention patterns and perceived cognitive load during hands-on tasks. The findings suggest that AR-mediated communication shows strong potential to improve attention management and communication accessibility in experiential learning environments, though participants emphasized that accessibility preferences are highly context-dependent. Participants also identified several design and ergonomic challenges, including display positioning, visual fatigue, and compatibility with hearing devices. Together, these results highlight both the promise of AR for supporting accessible participation in visually demanding environments and key design considerations for future systems.

[HC-6] Steering through Time: Blending Longitudinal Data with Simulation to Rethink Human-Autonomous Vehicle Interaction

【速读】:该论文旨在解决半自动化车辆(Semi-Automated Vehicles, SAVs)中控制权交接时人车交互有效性不足的安全挑战,尤其关注驾驶员在接管事件前的认知与生理状态缺乏时间维度信息的问题。其解决方案的关键在于构建一个融合纵向可穿戴传感与高保真驾驶模拟的混合框架,通过7天可穿戴生理数据采集、每日情绪与睡眠问卷以及实验室模拟中的多模态感知(包括眼动追踪、功能性近红外光谱fNIRS和生理指标),实现对驾驶员准备状态的动态评估,从而为个性化、情境感知的驾驶员监控系统提供数据支持与方法基础。

链接: https://arxiv.org/abs/2604.00832
作者: Yasaman Hakiminejad,Shiva Azimi,Luis Gomero,Elizabeth Pantesco,Irene P. Kan,Meltem Izzetoglu,Arash Tavakoli
机构: Villanova University (维拉诺瓦大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As semi-automated vehicles (SAVs) become more common, ensuring effective human-vehicle interaction during control handovers remains a critical safety challenge. Existing studies often rely on single-session simulator experiments or naturalistic driving datasets, which often lack temporal context on drivers’ cognitive and physiological states before takeover events. This study introduces a hybrid framework combining longitudinal mobile sensing with high-fidelity driving simulation to examine driver readiness in semi-automated contexts. In a pilot study with 38 participants, we collected 7 days of wearable physiological data and daily surveys on stress, arousal, valence, and sleep quality, followed by an in-lab simulation with scripted takeover events under varying secondary task conditions. Multimodal sensing, including eye tracking, fNIRS, and physiological measures, captured real-time responses. Preliminary analysis shows the framework’s feasibility and individual variability in baseline and in-task measures; for example, fixation duration and takeover control time differed by task type, and RMSSD showed high inter-individual stability. This proof-of-concept supports the development of personalized, context-aware driver monitoring by linking temporally layered data with real-time performance.

[HC-7] A Dual-Action Fabric-Based Soft Robotic Glove for Ergonomic Hand Rehabilitation

【速读】:该论文旨在解决神经损伤后手部功能障碍导致日常生活活动(Activities of Daily Living, ADL)能力受限的问题,尤其针对现有软体机器人手套在个性化适配、人体工学贴合度及屈伸协同驱动方面存在的局限性。解决方案的关键在于设计了一种基于织物的双动作软体机器人手套,集成与个体手指关节对齐的定制化气动执行器,每个手指配备独立控制的屈曲和伸展驱动单元,并额外设置拇指外展专用执行器;通过计算机数控热封技术制造具有凹面膨胀特性的对称腔室执行器,从而最大化手指接触面积并提升佩戴舒适性,同时实验验证其可提供满足ADL任务所需的关节力矩和指尖抓握力,初步研究表明该系统能显著降低前臂肌肉活动水平,促进更自然的抓握模式,为个性化手部康复与辅助机器人提供了可行路径。

链接: https://arxiv.org/abs/2604.00768
作者: Rui Chen,Firman Isma Serdana,Domenico Chiaradia,Xianlong Mai,Elena Losanno,Gabriele Righi,Claudia De Santis,Federica Serra,Vincent Mendez,Cristian Camardella,Daniele Leonardis,Giulio Del Popolo,Silvestro Micera,Antonio Frisoli
机构: Scuola Superiore Sant’Anna (圣安娜高等学院); University of Science and Technology of China (中国科学技术大学); Università Vita-Salute San Raffaele Sant’Anna School of Advanced Studies (圣拉斐尔生命健康大学圣安娜高级研究学院); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Hand impairment following neurological disorders substantially limits independence in activities of daily living, motivating the development of effective assistive and rehabilitation strategies. Soft robotic gloves have attracted growing interest in this context, yet persistent challenges in customization, ergonomic fit, and flexion-extension actuation constrain their clinical utility. Here, we present a dual-action fabric-based soft robotic glove incorporating customized actuators aligned with individual finger joints. The glove comprises five independently controlled dual-action actuators supporting finger flexion and extension, together with a dedicated thumb abduction actuator. Leveraging computer numerical control heat sealing technology, we fabricated symmetrical-chamber actuators that adopt a concave outer surface upon inflation, thereby maximizing finger contact area and improving comfort. Systematic characterization confirmed that the actuators generate sufficient joint moment and fingertip force for ADL-relevant tasks, and that the complete glove system produces adequate grasping force for common household objects. A preliminary study with ten healthy subjects demonstrated that active glove assistance significantly reduces forearm muscle activity during object manipulation. A pilot feasibility study with three individuals with cervical spinal cord injury across seven functional tasks indicated that glove assistance promotes more natural grasp patterns and reduces reliance on tenodesis grasp, although at the cost of increased task completion time attributable to the current actuation interface. This customizable, ergonomic design represents a practical step toward personalized hand rehabilitation and assistive robotics.

[HC-8] A wearable haptic device for edge and surface simulation

【速读】:该论文旨在解决虚拟现实(VR)环境中物体操作时触觉反馈不足的问题,特别是传统指尖触觉设备难以呈现边缘检测等关键触觉特征的局限性。其解决方案的关键在于设计了一种轻量化(24.3 g)、紧凑型的指尖触觉装置,采用新颖的双电机驱动机制,能够通过不同的刺激模式区分表面接触与边缘接触反馈;实验通过6×6柔性传感器阵列验证了两种刺激模式下压力分布的显著差异,并在初步用户研究中实现了93%的平均分类准确率和约2.79秒的平均响应时间,表明该设备能有效传递边缘与表面触觉线索,从而提升VR中物体操作的真实感与精确度。

链接: https://arxiv.org/abs/2604.00752
作者: Rui Chen,Xianlong Mai,Alireza Sanaei,Domenico Chiaradia,Antonio Frisoli,Daniele Leonardis
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Object manipulation is fundamental to virtual reality (VR) applications, yet conventional fingertip haptic devices fail to render certain tactile features relevant for immersive and precise interactions, as i.e. detection of edges. This paper presents a compact, lightweight fingertip haptic device (24.3 g) that delivers distinguishable surface and edge contact feedback through a novel dual-motor mechanism. Pressure distribution characterization using a 6 x 6 flexible sensor array demonstrates distinct contact patterns between the two stimulation modes. A preliminary user study with five participants achieved 93% average classification accuracy across four conditions (edge/surface contact with light/heavy pressure), with mean response times of 2.79 seconds. The results indicate that the proposed device can effectively convey edge and surface tactile cues, potentially enhancing object manipulation fidelity in VR environments.

[HC-9] In the Middle Not on Top: AI-Mediated Communication for Patient-Provider Care Relationships ALT

【速读】:该论文旨在解决人工智能(AI)在临床环境中引入后如何维持以关系为中心的照护(relationship-centered care)价值的问题,特别是如何在不削弱医患信任与意义连接的前提下,合理定位AI的角色。其解决方案的关键在于提出“居中而非主导”(middle, not top)的设计范式,即AI作为沟通中介而非决策主体,通过异步消息系统CLEAR的实证研究验证了该模式在缓解时间压力和健康素养差异等现实约束中的有效性;同时发现中介特性(如可用性、中立性)能够重新分配解释性工作并降低关系摩擦,从而将AI定位为一种关系基础设施(relational infrastructure),并强调设计中需权衡框架权力(framing power)与隐私保护之间的张力。

链接: https://arxiv.org/abs/2604.00643
作者: Ut Gong,Yibo Meng,Qihan Zhang,Xin Chen,Yan Guan
机构: Columbia University (哥伦比亚大学); Cornell University (康奈尔大学); The Sixth Primary School of Qianxi County (迁西县第六小学); Universidad Politécnica de Madrid (马德里理工大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure, Toward Relationship-Centered Care with AI: Designing for Human Connections in Healthcare workshop at CHI 2026

点击查看摘要

Abstract:Relationship-centered care relies on trust and meaningful connection. As AI enters clinical settings, we must ask not just what it can do, but how it should be positioned to support these values. We examine a “middle, not top” approach where AI mediates communication without usurping human judgment. Through studies of CLEAR, an asynchronous messaging system, we show how this configuration addresses real-world constraints like time pressure and uneven health literacy. We find that mediator affordances (e.g., availability, neutrality) redistribute interpretive work and reduce relational friction. Ultimately, we frame AI mediation as relational infrastructure, highlighting critical design tensions around framing power and privacy.

[HC-10] StretchBot: A Neuro-Symbolic Framework for Adaptive Guidance with Assistive Robots

【速读】:该论文旨在解决现有助行机器人在家庭和医疗环境中因依赖预设脚本而难以适应用户状态、环境情境及交互动态的问题,从而限制了其个性化与情境感知能力。解决方案的关键在于提出StretchBot——一种混合神经符号(neuro-symbolic)机器人教练系统,通过融合多模态感知与基于知识图谱的大语言模型推理,实现短时拉伸训练中的上下文自适应调整,同时保持结构化训练流程的稳定性。该方法利用结构化的可操作知识来约束语言模型的适应行为,使生成式AI(Generative AI)驱动的交互更具语义合理性与任务导向性。

链接: https://arxiv.org/abs/2604.00628
作者: Luca Vogelgesang,Ahmed Mehdi Soltani,Mohammadhossein Khojasteh,Xinrui Zu,Stefano De Giorgis,Madalina Croitoru,Filip Ilievski
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Assistive robots have growing potential to support physical wellbeing in home and healthcare settings, for example, by guiding users through stretching or rehabilitation routines. However, existing systems remain largely scripted, which limits their ability to adapt to user state, environmental context, and interaction dynamics. In this work, we present StretchBot, a hybrid neuro-symbolic robotic coach for adaptive assistive guidance. The system combines multimodal perception with knowledge-graph-grounded large language model reasoning to support context-aware adjustments during short stretching sessions while maintaining a structured routine. To complement the system description, we report an exploratory pilot comparison between scripted and adaptive guidance with three participants. The pilot findings suggest that the adaptive condition improved perceived adaptability and contextual relevance, while scripted guidance remained competitive in smoothness and predictability. These results provide preliminary evidence that structured actionable knowledge can help ground language-model-based adaptation in embodied assistive interaction, while also highlighting the need for larger, longitudinal studies to evaluate robustness, generalizability, and long-term user experience.

[HC-11] HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models

【速读】:该论文旨在解决社交虚拟现实(Social Virtual Reality, SVR)平台中用户面临在线骚扰行为的安全问题,尤其是现有防护措施多为被动响应、依赖敏感生物特征数据导致隐私风险较高的局限性。其解决方案的关键在于提出HarassGuard系统,该系统基于视觉语言模型(Vision-Language Model, VLM),仅使用视觉输入即可在社交VR场景中识别物理骚扰行为,通过构建经IRB批准的骚扰视觉数据集、应用提示工程(prompt engineering)并微调VLM以结合上下文信息进行检测,在保证高精度(二分类准确率达88.09%,多分类达68.85%)的同时显著减少所需微调样本量(200 vs. 1,115),从而实现更高效且隐私友好的主动式骚扰检测。

链接: https://arxiv.org/abs/2604.00592
作者: Junhee Lee,Minseok Kim,Hwanjo Heo,Seungwon Woo,Jinwoo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: To appear in the 2026 TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

点击查看摘要

Abstract:Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.

[HC-12] Not My Truce: Personality Differences in AI-Mediated Workplace Negotiation

【速读】:该论文试图解决的问题是:当前基于人工智能(AI)的对话式辅导在职场谈判支持中被广泛采用,但现有研究普遍假设其对所有用户均具有相同效果,忽视了个体差异对干预成效的影响。解决方案的关键在于引入人格特质作为调节变量,通过实证实验识别不同人格类型(基于大五人格与ARC类型学划分的三类群体:韧性型、过度控制型和不足控制型)对AI辅导工具的差异化响应模式。研究发现,仅针对特定人格类型的用户设计匹配强度的干预策略——如韧性型受益于传统手册,过度控制型从理论驱动型AI中获益显著,而不足控制型则对任何形式的干预反应有限——才能提升辅导的有效性。这表明,个体心理准备状态比阶段适配更重要,为自适应AI辅导系统的设计提供了以“个体 readiness”为导向的新范式。

链接: https://arxiv.org/abs/2604.00464
作者: Veda Duddu,Jash Rajesh Parekh,Andy Mao,Hanyi Min,Ziang Xiao,Vedant Das Swain,Koustuv Saha
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Johns Hopkins University(约翰霍普金斯大学); New York University(纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI-driven conversational coaching is increasingly used to support workplace negotiation, yet prior work assumes uniform effectiveness across users. We challenge this assumption by examining how individual differences, particularly personality traits, moderate coaching outcomes. We conducted a between-subjects experiment (N=267) comparing theory-driven AI (Trucey), general-purpose AI (Control-AI), and a traditional negotiation handbook (Control-NoAI). Participants were clustered into three profiles – resilient, overcontrolled, and undercontrolled – based on the Big-Five personality traits and ARC typology. Resilient workers achieved broad psychological gains primarily from the handbook, overcontrolled workers showed outcome-specific improvements with theory-driven AI, and undercontrolled workers exhibited minimal effects despite engaging with the frameworks. These patterns suggest personality as a predictor of readiness beyond stage-based tailoring: vulnerable users benefit from targeted rather than comprehensive interventions. The study advances understanding of personality-determined intervention prerequisites and highlights design implications for adaptive AI coaching systems that align support intensity with individual readiness, rather than assuming universal effectiveness.

[HC-13] Sona: Real-Time Multi-Target Sound Attenuation for Noise Sensitivity

【速读】:该论文旨在解决噪声敏感人群在日常声景中因无法有效区分和过滤干扰声音而导致的不适问题。现有工具如主动降噪技术虽能降低整体环境噪声,但会牺牲对周围人声及重要事件的感知能力,影响社交参与和安全意识。解决方案的关键在于提出Sona系统,这是一个基于目标条件神经管道(target-conditioned neural pipeline)的实时移动音频中介系统,能够同时对多个重叠声源进行选择性衰减,而保留期望音频内容;其核心创新包括:支持多目标、低延迟的实时处理能力,以及通过现场音频示例实现用户自定义声类扩展且无需重新训练模型,从而在减少烦扰声音的同时维持对环境的感知能力。

链接: https://arxiv.org/abs/2604.00447
作者: Jeremy Zhengqi Huang,Emani Hicks,Sidharth,Gillian R. Hayes,Dhruv Jain
机构: University of Michigan (密歇根大学); University of California, Irvine (加州大学欧文分校)
类目: ound (cs.SD); Human-Computer Interaction (cs.HC)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.

[HC-14] Programming by Chat: A Large-Scale Behavioral Analysis of 11579 Real-World AI-Assisted IDE Sessions

【速读】:该论文旨在解决当前对集成开发环境(IDE)中生成式 AI 编程助手(AI coding assistants)在真实开发场景下如何影响编程工作组织方式的实证研究不足的问题。现有研究多依赖小规模受控实验或通用聊天机器人,缺乏对代码库感知型 IDE 工作流的深入分析。其解决方案的关键在于开展迄今最大规模的真实世界对话式编程行为分析,基于 74,998 条来自 1,300 个仓库、899 名开发者使用 Cursor 和 GitHub Copilot 的聊天记录进行系统性挖掘,揭示了三大工作模式转变:渐进式规格化(progressive specification)、认知任务向 AI 的再分配(cognitive work redistribution),以及开发者对协作过程的主动管理(collaboration management)。这些发现为理解 AI 辅助编程的本质提供了基础性实证依据,并对下一代编程环境的设计具有重要启示。

链接: https://arxiv.org/abs/2604.00436
作者: Ningzhi Tang,Chaoran Chen,Zihan Fang,Gelei Xu,Maria Dhakal,Yiyu Shi,Collin McMillan,Yu Huang,Toby Jia-Jun Li
机构: University of Notre Dame (圣母大学); Vanderbilt University (范德堡大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:IDE-integrated AI coding assistants, which operate conversationally within developers’ working codebases with access to project context and multi-file editing, are rapidly reshaping software development. However, empirical investigation of this shift remains limited: existing studies largely rely on small-scale, controlled settings or analyze general-purpose chatbots rather than codebase-aware IDE workflows. We present, to the best of our knowledge, the first large-scale study of real-world conversational programming in IDE-native settings, analyzing 74,998 developer messages from 11,579 chat sessions across 1,300 repositories and 899 developers using Cursor and GitHub Copilot. These chats were committed to public repositories as part of routine development, capturing in-the-wild behavior. Our findings reveal three shifts in how programming work is organized: conversational programming operates as progressive specification, with developers iteratively refining outputs rather than specifying complete tasks upfront; developers redistribute cognitive work to AI, delegating diagnosis, comprehension, and validation rather than engaging with code and outputs directly; and developers actively manage the collaboration, externalizing plans into persistent artifacts, and negotiating AI autonomy through context injection and behavioral constraints. These results provide foundational empirical insights into AI-assisted development and offer implications for the design of future programming environments.

[HC-15] Physically-intuitive Privacy and Security: A Design Paradigm for Building User Trust in Smart Sensing Environments

【速读】:该论文旨在解决用户对基于传感器的交互系统(如智能音箱、网络摄像头和RFID标签)中存在的隐私与安全信任缺失问题。尽管存在隐私和安全控制机制,用户仍因担心设备制造商、应用开发者及第三方恶意收集并商业化其个人数据而产生不信任感。解决方案的关键在于提出一种新的设计范式——物理直观的隐私与安全(Physically-Intuitive Privacy and Security, PIPS),其核心是通过提供简单、基于物理世界的概念模型来增强用户对隐私与安全控制的理解和信任。PIPS包含三个原则:(1) 直接物理操作传感器状态;(2) 可感知的传感器状态保证;(3) 与用户意图一致的传感器启停机制。这三个原则在三个案例研究中得到验证,均显著提升了用户信任度。

链接: https://arxiv.org/abs/2604.00312
作者: Youngwook Do,Yuxi Wu,Gregory D. Abowd,Sauvik Das
机构: Georgia Institute of Technology (佐治亚理工学院); Northeastern University (东北大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Sensor-based interactive systems – e.g., “smart” speakers, webcams, and RFID tags – allow us to embed computational functionality into physical environments. They also expose users to real and perceived privacy risks: users know that device manufacturers, app developers, and malicious third parties want to collect and monetize their personal data, which fuels their mistrust of these systems even in the presence of privacy and security controls. We propose a new design paradigm, physically-intuitive privacy and security (PIPS), which aims to improve user trust by designing privacy and security controls that provide users with simple, physics-based conceptual models of their operation. PIPS consists of three principles: (1) direct physical manipulation of sensor state; (2) perceptible assurance of sensor state; and, (3) intent-aligned sensor (de)activation. We illustrate these principles through three case studies – Smart Webcam Cover, Powering for Privacy, and On-demand RFID – each of which has been shown to improve trust relative to existing sensor-based systems.

[HC-16] Play-Testing REMind: Evaluating an Educational Robot-Mediated Role-Play Game

【速读】:该论文旨在解决儿童在校园欺凌情境中缺乏有效旁观者干预能力的问题,尤其是如何提升其自我效能感、共情理解力及防御策略的实践能力。解决方案的关键在于提出一种名为REMind的教育机器人媒介角色扮演游戏(Robot-Mediated Role-Play Game),通过社会机器人演绎欺凌场景,引导儿童反思角色立场,并借助操控机器人化身进行干预策略的模拟演练,从而实现社会情感学习(Social-Emotional Learning, SEL)目标。

链接: https://arxiv.org/abs/2604.00300
作者: Elaheh Sanoubari,Neil Fernandes,Keith Rebello,Alicia Pan,Andrew Houston,Kerstin Dautenhahn
机构: University of Waterloo (滑铁卢大学)
类目: Robotics (cs.RO); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:This paper presents REMind, an innovative educational robot-mediated role-play game designed to support anti-bullying bystander intervention among children. REMind invites players to observe a bullying scenario enacted by social robots, reflect on the perspectives of the characters, and rehearse defending strategies by puppeteering a robotic avatar. We evaluated REMind through a mixed-methods play-testing study with 18 children aged 9–10. The findings suggest that the experience supported key learning goals related to self-efficacy, perspective-taking, understanding outcomes of defending, and intervention strategies. These results highlight the promise of Robot-Mediated Applied Drama (RMAD) as a novel pedagogical framework to support Social-Emotional Learning.

[HC-17] NeuroVase: A Tangible Mobile Augmented Reality Learning System for Neurovascular Anatomy and Stroke Education

【速读】:该论文旨在解决当前卒中相关神经血管解剖学教学中因依赖静态二维图示和印刷材料而导致的学习效率低下问题,这些问题限制了学习者对复杂脑血管系统及其临床意义的深入理解。解决方案的关键在于提出并实现了一个基于平板电脑的增强现实(Augmented Reality, AR)平台NeuroVase,其核心创新包括:一是采用双模式设计,利用实体提示卡作为独立学习工具的同时作为AR内容触发标记;二是构建以脑血管解剖与卒中为核心的教学课程,支持在标注的三维解剖模型中交互式探索血管分布区、卒中综合征及动脉闭塞等关键概念。实证研究表明,该平台显著提升了学习效果和用户友好性,为卒中教育提供了直观、沉浸且可访问的新型教学手段。

链接: https://arxiv.org/abs/2604.00296
作者: Bahar Jahani,Matsanga Leyila Kaseka,Marta Kersten-Oertel,Yiming Xiao
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Stroke remains a leading cause of mortality and disability worldwide, requiring rapid and informed clinical decision-making. A solid spatial understanding of cerebrovascular anatomy and vascular territories in relation to stroke symptoms and severity is critical for timely clinical decision and patient care. However, this knowledge is typically conveyed through static 2D diagrams and printed materials, which can hinder mastery of the complex neurovascular system and their clinical implications. Mobile augmented reality (AR) offers an accessible medium for delivering intuitive 3D anatomical education, yet applications focused on the neurovascular system and stroke remain limited despite the demand. To address this, we propose NeuroVase, a tablet-based mobile AR platform within a structured pedagogical framework that enhances stroke-related neuroanatomy learning by providing an interactive, engaging, and accessible alternative to traditional methods. NeuroVase features a dual-mode setup, using tangible cue cards as standalone study aids while also serving as interactive markers for AR content delivery. A custom learning curriculum focused on cerebrovascular anatomy and stroke supports exploration of vascular territories, stroke syndromes, and arterial occlusions, in the context of annotated 3D anatomical models in NeuroVase. A controlled user study with 40 participants revealed that NeuroVase is an effective and user-friendly AR platform to facilitate complex anatomical and physiological education, compared with traditional learning.

[HC-18] Not Just Duolingo: Supporting Immigrant Language Preservation Through Family-Based Play

【速读】:该论文旨在解决移民群体在异国环境中难以维持母语传承的问题,尤其关注尼泊尔移民在美国面临的社会政治环境对语言保存的挑战。研究发现,尽管移民家庭有强烈动机保留母语,但受限于制度支持不足、时间与资源匮乏以及英语主导的环境,导致亲子间语言能力差距扩大。为此,作者提出一种以“可理解输入理论”(comprehensible input theory)为基础的音频优先、点按交互的语言学习游戏方案,核心设计在于通过亲子共玩模式促进家庭内部的语言互动,从而强化语言传承关系。初步评估显示该游戏具有良好的玩法潜力,但需简化符号密集的用户界面(UI),以提升可用性。

链接: https://arxiv.org/abs/2604.00282
作者: Alejandro Ciuba,Zheng YY Li,Aakash Gautam
机构: University of Pittsburgh(匹兹堡大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026

点击查看摘要

Abstract:For immigrants, language preservation is crucial to maintain their identity, but the process of immigration can put a strain on a community’s ability to do so. We interviewed eight Nepali immigrants to understand barriers to language preservation across sociopolitical contexts in Nepal and immigrant life in the United States. Participants described strong motivation but limited institutional support, time and resource constraints, and English-dominant environments that widen parent-child language gaps. They envisioned technology that supports interactive, family centered learning. In response, we are developing an audio-first, point-and-click language learning game based on the theory of comprehensible input, designed for parent-child co-playing. An early evaluation with four design experts reveals promising gameplay, and the need to simplify symbol-heavy UI. We conclude with implications for designing language technologies that support preservation through relations while acknowledging the limits of design.

[HC-19] Explainable AI for Blind and Low-Vision Users: Navigating Trust Modality and Interpretability in the Agent ic Era

【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)用户在使用生成式 AI(Generative AI)驱动的辅助技术时,因缺乏可访问的可解释人工智能(Explainable Artificial Intelligence, XAI)而面临的根本性障碍问题。随着AI系统从单次查询工具演变为具备多步骤行动能力的自主代理(agentic systems),其决策过程复杂度显著提升,BLV用户难以获取及时、有效的解释以识别和纠正错误,进而导致信任缺失与使用受限。论文提出的关键解决方案包括:构建面向BLV用户的多模态接口(multimodal interfaces)、设计具备“责备意识”(blame-aware explanation design)的解释机制,以及推动参与式开发(participatory development)模式,从而确保XAI在复杂任务场景下对BLV用户真正可用、可信且公平。

链接: https://arxiv.org/abs/2604.00187
作者: Abu Noman Md Sakib,Protik Dey,Zijie Zhang,Taslima Akter
机构: University of Texas at San Antonio (圣安东尼奥大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Human-centered Explainable AI Workshop (HCXAI) @ CHI 2026, Barcelona, Spain, 2026

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience “self-blame” for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.

[HC-20] Practice Less Explain More: LLM -Supported Self-Explanation Improves Explanation Quality on Transfer Problems in Calculus DATE

【速读】:该论文旨在解决如何在有限的学习时间内提升学生在微积分学习中对迁移问题(transfer problems)的解释质量,尤其是在“信息不足”(Not Enough Information, NEI)类型的问题上。其核心问题是传统自解释策略(self-explanation)在时间受限条件下是否能有效促进深层理解与迁移能力,以及大语言模型(Large Language Model, LLM)生成的反馈能否增强这一过程。解决方案的关键在于引入LLM支持的开放式自解释(open-ended self-explanation with LLM-generated feedback),相较于无自解释和菜单式自解释条件,该方法虽使学习者完成的练习题量减少,却显著提升了NEI类迁移问题的解释质量,表明LLM能够通过即时、个性化的反馈优化自解释过程,从而更高效地促进高阶认知加工。

链接: https://arxiv.org/abs/2604.00142
作者: Eason Chen,Xinyi Tang,Yvonne Zhao,Meiyi Chen,Meryam Elmir,Elizabeth McLaughlin,Mingyu Yuan,Yumo Wang,Shyam Agarwal,Jared Cochrane,Jionghao Lin,Tongshuang Wu,Ken Koedinger
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 figures. Accepted at AIED 2026. Camera-ready version with updated references

点击查看摘要

Abstract:We conducted a between-subjects experiment (N=92) comparing three conditions in a calculus learning environment: no self-explanation (control), menu-based self-explanation, and open-ended self-explanation with LLM-generated feedback. All conditions showed positive learning gains within a fixed 60-minute practice session, with no significant between-condition differences in post-test performance. On transfer questions, the open-ended condition produced significantly higher-quality explanations than control on “Not Enough Information” (NEI) problems ( \beta =+11.9 percentage points, p =.030), though the corresponding NEI multiple-choice accuracy advantage was not significant ( p =.183). Moreover, across all post-test open-ended explanations, the open-ended condition showed a marginally significant advantage ( \beta =+7.3%, p =.057). These findings suggest that LLM-supported open-ended self-explanation can improve explanation quality on NEI transfer problems, with weaker evidence across broader transfer explanation measures. Notably, these effects emerged even though learners in the open-ended condition completed substantially fewer practice problems within the same practice time.

[HC-21] Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理健康问答场景中对高压力、叙事性求助信息的评估不足问题,尤其是忽视了临床安全关键内容的遗漏与幻觉现象。其解决方案的关键在于提出UTCO(User, Topic, Context, Tone)提示构建框架,将用户提问系统性地拆解为四个可控制变量,从而生成2,075个结构化测试用例,实现对LLM响应中幻觉(hallucinations)和遗漏(omissions)的量化分析。研究发现,遗漏问题在危机和自杀意念类提示中尤为集中,且上下文与语气因素是导致失败的核心变量,支持将遗漏作为首要安全指标,并推动从静态基准集向动态、可控的系统性压力测试演进。

链接: https://arxiv.org/abs/2604.00014
作者: Congning Ni,Sarvech Qadir,Bryan Steitz,Mihir Sachin Vaidya,Qingyuan Song,Lantian Xia,Shelagh Mulvaney,Siru Liu,Hyeyoung Ryu,Leah Hecht,Amy Bucher,Christopher Symons,Laurie Novak,Susannah L. Rose,Murat Kantarcioglu,Bradley Malin,Zhijun Yin
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Submitted to AMIA 2026 Annual Symposium (under review)

点击查看摘要

Abstract:Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.

计算机视觉

[CV-0] HippoCamp: Benchmarking Contextual Agents on Personal Computers

【速读】:该论文旨在解决当前智能代理(Agent)在真实用户场景下进行多模态文件管理能力评估的缺失问题,尤其是针对个人文件系统中复杂上下文感知推理、跨模态信息整合与长期检索等挑战。其解决方案的关键在于构建了一个名为HippoCamp的新基准测试平台,该平台基于真实用户的设备级文件系统(涵盖42.4 GB数据和2000+个真实文件),并设计了581组问答对用于评估搜索、证据感知及多步推理能力;同时提供46.1K条结构化轨迹以支持细粒度失败诊断,从而揭示当前多模态大语言模型(MLLMs)和代理方法在用户画像建模与跨模态理解方面的瓶颈,推动下一代个性化AI助手的发展。

链接: https://arxiv.org/abs/2604.01221
作者: Zhe Yang,Shulin Tian,Kairui Hu,Shuai Liu,Hoang-Nhat Nguyen,Yichi Zhang,Zujin Guo,Mengying Yu,Zinan Zhang,Jingkang Yang,Chen Change Loy,Ziwei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present HippoCamp, a new benchmark designed to evaluate agents’ capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents’ capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

[CV-1] LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

【速读】:该论文旨在解决从稀疏时空观测中重建或预测完整时空动力学的问题,这在复杂系统研究中具有重要意义,尤其当测量在空间上不完整且时间窗口狭窄时。其解决方案的关键在于提出一种模块化架构LAPIS-SHRED(LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders),该架构包含三个阶段:首先,使用仿真数据预训练SHRED模型将传感器时间序列映射到结构化的潜在空间;其次,训练一个时间序列模型学习在潜在空间中向前或向后传播状态以填补未观测的时间区域;最后,在部署阶段仅需提供真实系统中的短时间窗口超稀疏传感器数据,结合冻结的SHRED模型与时间模型即可重构或预测完整的时空轨迹。此方法支持双向推理,并具备数据同化和多尺度重建能力,适用于观测受限的运行场景。

链接: https://arxiv.org/abs/2604.01216
作者: Yuxuan Bao,Xingyue Zhang,J. Nathan Kutz
机构: University of Washington (华盛顿大学); Autodesk Research (Autodesk 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

[CV-2] RACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

【速读】:该论文旨在解决现有3D场景编辑方法在实现细粒度、局部化操作(如部件级姿态调整或组件替换)时,难以保持中心主体结构完整性的难题。当前主流方法多依赖隐式几何信息,缺乏对三维空间的显式约束,导致编辑结果易出现形变或不一致。解决方案的关键在于提出TRACE框架,其核心创新包括:(1)基于多视角一致性数据集MV-TRACE训练的多视图3D锚点合成模块,生成空间一致的3D引导信号;(2)通过两阶段注册实现插入网格与3D高斯表示(3DGS)场景之间的精确空间对齐(Tangible Geometry Anchoring, TGA);(3)结合上下文视频掩码机制(Contextual Video Masking, CVM),将3D投影融入自回归视频生成流程,从而实现时序稳定且物理合理的渲染。该方案显著提升了编辑的灵活性与结构保真度。

链接: https://arxiv.org/abs/2604.01207
作者: Jiyuan Hu,Zechuan Zhang,Zongxin Yang,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio–such as local pose shifting or component replacemen–while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase–the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio–to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

[CV-3] Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

【速读】:该论文旨在解决基于基础图形(primitive-based)方法(如3D高斯溅射)在建模高频细节时表达能力受限的问题,同时兼顾计算效率与重建质量。其解决方案的关键在于提出神经谐波纹理(Neural Harmonic Textures),即在每个基础图形周围设置一个虚拟支架(virtual scaffold),并锚定潜在特征向量;在光线与图形相交点处对这些特征进行插值,并通过周期性激活函数处理,将传统的alpha混合转化为谐波分量的加权求和;最终通过一个小型神经网络在单次延迟传递中解码该信号,显著降低计算开销,从而实现实时新视角合成中的最先进性能,并弥合了基于基础图形与神经场(neural field)方法之间的差距。

链接: https://arxiv.org/abs/2604.01204
作者: Jorge Condor,Nicolas Moenne-Loccoz,Merlin Nimier-David,Piotr Didyk,Zan Gojcic,Qi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.

[CV-4] A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

【速读】:该论文旨在解决基础视觉语言模型(Foundation Vision-Language Models)在机器人软件栈中实际部署的可复现性问题,即如何将高性能但复杂的模型有效集成到机器人操作系统(ROS 2)中以支持多样化感知任务。解决方案的关键在于开发了一个针对Florence-2模型的ROS 2封装器(wrapper),通过三种互补的交互模式——持续话题驱动处理、同步服务调用和异步动作——实现灵活且高效的集成;同时支持本地执行与Docker容器化部署,并提供通用JSON输出与标准ROS 2消息绑定,从而在消费级硬件上实现了高吞吐量的实时推理能力。

链接: https://arxiv.org/abs/2604.01179
作者: J. E. Domínguez-Vidal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: this https URL

[CV-5] Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects

【速读】:该论文旨在解决工业场景中3D异常检测的开放集监督问题,即模型仅用正常样本和少量已知异常样本进行训练,却需在测试时识别未知异常。其关键解决方案是提出Open3D-AD方法,该方法通过融合正常样本、模拟异常与部分观测的真实异常,构建正常与异常数据的概率密度分布模型,并引入简单的对应分布子采样(Correspondence Distributions Subsampling)策略以降低正常与非正常分布间的重叠,从而增强双分布建模能力,显著提升对未知异常的识别性能。

链接: https://arxiv.org/abs/2604.01171
作者: Hanzhe Liang,Luocheng Zhang,Junyang Xia,HanLiang Zhou,Bingyang Guo,Yingxi Xie,Can Gao,Ruiyun Yu,Jinbao Wang,Pan Li
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); Chinese Academy of Sciences (中国科学院); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Resources: this https URL

点击查看摘要

Abstract:Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to identify unknown anomalies at test time. We present Open-Industry, a high-quality industrial dataset containing 15 categories, each with five real anomaly types collected from production lines. We first adapt general open-set anomaly detection methods to accommodate 3D point cloud inputs better. Building upon this, we propose Open3D-AD, a point-cloud-oriented approach that leverages normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. Then, we introduce a simple Correspondence Distributions Subsampling to reduce the overlap between normal and non-normal distributions, enabling stronger dual distributions modeling. Based on these contributions, we establish a comprehensive benchmark and evaluate the proposed method extensively on Open-Industry as well as established datasets including Real3D-AD and Anomaly-ShapeNet. Benchmark results and ablation studies demonstrate the effectiveness of Open3D-AD and further reveal the potential of open-set supervised 3D anomaly detection.

[CV-6] Looking into a Pixel by Nonlinear Unmixing – A Generative Approach

【速读】:该论文旨在解决高光谱非线性混叠(Hyperspectral Nonlinear Unmixing, HNU)问题,即在缺乏显式混合模型先验知识的情况下实现高精度的端元提取与丰度估计。传统方法依赖于特定的混合模型假设,限制了其性能和泛化能力。解决方案的关键在于提出一种基于双向生成对抗网络(Bi-directional GAN)的可逆混合-解混流程,通过引入循环一致性(cycle consistency)和线性与非线性混合之间的关联约束,无需显式建模混合机制即可实现稳定且高效的解混。该方法被称为线性约束循环GAN解混网络(Linearly-constrained CycleGAN Unmixing net, LCGU net),实验表明其在多个数据集上均优于现有基于模型的HNU方法。

链接: https://arxiv.org/abs/2604.01141
作者: Maofeng Tang,Hairong Qi
机构: University of Tennessee, Knoxville (田纳西大学诺克斯维尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.

[CV-7] oward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling

【速读】:该论文旨在解决传统飞镖训练依赖经验与视觉观察、难以实现高精度目标导向动作优化的问题,尤其针对现有定量方法多局限于局部变量或静态模板匹配,无法有效支持个性化训练且忽略运动变异性等局限。解决方案的关键在于构建一个闭环的数据驱动训练辅助系统,涵盖动作捕捉、特征建模与个性化反馈三个环节:通过无标记的Kinect 2.0深度传感器和光学相机采集数据,提取18个来自三-link协调性、释放速度、多关节角度配置及姿势稳定性的生物力学特征;并开发两个核心模块——融合历史高质量样本与最小 jerk 准则的个性化最优投掷轨迹模型,以及基于z-score与分层逻辑的运动偏差诊断与推荐模型,从而将评估标准从统一规范转向个体最优控制范围,显著提升飞镖训练的个性化程度与可解释性。

链接: https://arxiv.org/abs/2604.01130
作者: Zhantao Chen,Dongyi He,Jin Fang,Xi Chen,Yisuo Liu,Xiaozhen Zhong,Xuejun Hu
机构: Qinghai University (青海大学); Zhongshan Xiaolan Senior High School (中山市小榄镇高级中学); Chongqing University of Technology (重庆理工大学); The Hong Kong Polytechnic University (香港理工大学); University of Macau (澳门大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual’s optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.

[CV-8] ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation

【速读】:该论文旨在解决动态驾驶场景中轨迹编辑的可控性与真实性问题,尤其针对安全关键场景(如前车碰撞、车辆漂移、行人横穿等)的模拟需求。现有方法难以在不破坏场景一致性的情况下实现自由编辑,并且在分布外(out-of-distribution)条件下生成质量下降。解决方案的关键在于提出ReinDriveGen框架:首先从多帧LiDAR数据构建动态3D点云场景,并引入车辆补全模块以重建360°几何结构;随后将编辑后的场景渲染为2D条件图像,引导视频扩散模型生成逼真驾驶视频;最后设计基于强化学习(Reinforcement Learning, RL)的后训练策略,结合成对偏好模型与奖励机制,在无真实标签监督下显著提升分布外场景的生成质量,从而实现高可控性和强鲁棒性的驾驶场景编辑能力。

链接: https://arxiv.org/abs/2604.01129
作者: Hao Zhang,Lue Fan,Weikang Bian,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li
机构: 1: Alibaba Group (阿里巴巴集团); 2: Tsinghua University (清华大学); 3: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.

[CV-9] Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

【速读】:该论文旨在解决将视觉语言模型(Vision-Language Models, VLMs)如CLIP用于单目深度估计任务时,常需大量微调且几何精度不足的问题。解决方案的关键在于提出一种参数高效的框架MoA-DepthCLIP,其核心创新是引入轻量级的Mixture-of-Adapters(MoA)模块嵌入预训练Vision Transformer(ViT-B/32)主干网络,并结合对最终层的选择性微调,实现基于全局语义上下文向量的空间感知适应;同时采用融合深度分箱分类与直接回归的混合预测架构,并设计复合损失函数以强化几何约束,从而在仅需极少可训练参数的情况下显著提升深度估计的准确性。

链接: https://arxiv.org/abs/2604.01118
作者: Reyhaneh Ahani Manghotay(Simon Fraser University, Burnaby, Canada),Jie Liang(Eastern Institute of Technology, Ningbo, China)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the \delta_1 accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

[CV-10] ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning)中因新类别的语义特征与已训练类别重叠而导致的灾难性遗忘(Catastrophic Forgetting)问题,尤其关注文本提示(Text Prompt)方法在生成类别特定语义特征时难以学习唯一提示的问题。解决方案的关键在于提出一种原型引导的文本提示选择机制(Prototype-guided Text Prompt Selection, ProTPS),通过学习每个类别的视觉原型(Vision Prototype)来指导文本提示的选择与优化,从而增强文本提示的区分度和训练灵活性,有效避免语义特征混淆,提升模型在类别增量(Class Incremental, CI)、跨数据集持续学习(Cross-Datasets Continual, CDC)以及类别与领域增量(Class and Domain Incremental, CDI)场景下的性能表现。

链接: https://arxiv.org/abs/2604.01116
作者: Jie Mei,Li-Leng Peng,Keith Fuller,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学); Alaska Pacific University (阿拉斯加太平洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For continual learning, text-prompt-based methods leverage text encoders and learnable prompts to encode semantic features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn unique text prompts, which implicitly carry semantic information of new classes, so that the semantic features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS)‘’ to intentionally increase the training flexibility thus encouraging the learning of unique text prompts. Specifically, our ProTPS learns class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts for each class. We first evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Because our ProTPS achieves performance close to the upper bounds, we further collect a real-world dataset with 112 marine species collected over a span of six years, named Marine112, to bring new challenges to the community. Marine112 is authentically suited for the class and domain incremental (CDI) learning setting and is under natural long-tail distribution. The results under three settings show that our ProTPS performs favorably against the recent state-of-the-art methods. The implementation code and Marine112 dataset will be released upon the acceptance of our paper.

[CV-11] RACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

【速读】:该论文旨在解决部分音频深度伪造(Partial Audio Deepfakes)的检测问题,即合成片段被拼接至真实录音中,导致大多数音频仍保持真实性,从而难以被现有检测方法识别。传统检测方法依赖于帧级标注数据,易过拟合特定生成流程,且需针对新生成模型重新训练,缺乏泛化能力。解决方案的关键在于提出一种无需训练的框架TRACE(Training-free Representation-based Audio Countermeasure via Embedding dynamics),其核心假设是:真实语音在冻结的语音基础模型(Speech Foundation Model)中会形成平滑、缓慢变化的嵌入轨迹,而拼接边界则会在帧级过渡中引入突变;因此,通过分析这些嵌入动态的一阶变化即可实现检测,无需任何训练数据或架构修改,具备强泛化能力。

链接: https://arxiv.org/abs/2604.01083
作者: Awais Khan,Muhammad Umar Farooq,Kutub Uddin,Khalid Malik
机构: University of Michigan-Flint (密歇根大学弗林特分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.

[CV-12] ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data CVPR2026

【速读】:该论文旨在解决实时人机交互中的“互动到反应生成”问题,即从动态多源线索(包括他人动作、场景几何结构及可选的高层语义输入)中生成主体未来的运动轨迹。这一任务的核心挑战在于:(i) 交互数据分布于异构的单人、人-人和人-场景领域,且数据稀疏碎片化;(ii) 在持续在线交互中需实现低延迟与高保真度运动响应的平衡。解决方案的关键是提出 ReMoGen(Reaction Motion Generation),一个模块化学习框架:通过利用大规模单人运动数据预训练的通用运动先验,并结合独立训练的元交互(Meta-Interaction)模块适配目标交互域,从而在数据稀缺和监督异构条件下实现鲁棒泛化;同时引入轻量级帧级片段精修模块,在片段级别生成基础上融合新观测帧级线索,提升响应速度与时间连贯性,避免昂贵的全序列推理。

链接: https://arxiv.org/abs/2604.01082
作者: Yaoqin Ye,Yiteng Xu,Qin Sun,Xinge Zhu,Yujing Sun,Yuexin Ma
机构: ShanghaiTech University (上海科技大学); Guangzhou Institute of Energy Conversion, CAS (中国科学院广州能源研究所); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: accepted by CVPR 2026, project page: this https URL

点击查看摘要

Abstract:Human behaviors in real-world environments are inherently interactive, with an individual’s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego’s future motion from dynamic multi-source cues, including others’ actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

[CV-13] ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction CVPR2026

【速读】:该论文旨在解决3D语义占据预测(3D semantic occupancy prediction)中长期尾部类别偏差(long-tailed class bias)和分布外(out-of-distribution, OOD)输入导致的过度自信问题,即模型常将异常样本错误地分配给稀有类别。其解决方案的核心在于提出一种轻量级、即插即用的方法 ProOOD,关键创新包括:(i) 基于原型引导的语义填充机制,用于在遮挡区域注入类别一致特征;(ii) 基于原型引导的尾部挖掘策略,强化稀有类别的表征以抑制OOD样本被误吸收;(iii) EchoOOD模块,融合局部logit一致性与局部及全局原型匹配,实现可靠的体素级OOD评分。该方法在多个数据集上显著提升了in-distribution占据预测性能与OOD检测能力,尤其在SemanticKITTI和VAA-KITTI上分别实现了mIoU和AuPRCr的大幅改进,增强了自动驾驶场景下安全关键任务中的预测校准性和鲁棒性。

链接: https://arxiv.org/abs/2604.01081
作者: Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Di Wen,Danda Pani Paudel,Luc Van Gool,Kailun Yang
机构: Hunan University(湖南大学); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院); INSAIT, Sofia University “St. Kliment Ohridski”(INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2026. The source code is publicly available at this https URL

点击查看摘要

Abstract:3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at this https URL.

[CV-14] PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

【速读】:该论文旨在解决虚拟对比增强(Virtual Contrast Enhancement, VCE)中因解剖异质性和空间错位导致的增强模式不一致及细节错误问题,从而提升从非对比CT(Non-Contrast CT, NCCT)合成对比增强CT(Contrast-Enhanced CT, CECT)的精度与可靠性。其解决方案的关键在于提出PHASOR框架,该框架基于体积扩散模型,通过将CT体积视为连贯序列来增强结构一致性与体积准确性;同时引入两个互补模块:一是解剖路由的专家混合模型(Anatomy-Routed Mixture-of-Experts, AR-MoE),将不同增强模式锚定至解剖语义,并利用器官特异性记忆捕捉显著细节;二是强度相位感知表示对齐模块(Intensity-Phase Aware Representation Alignment, IP-REPA),强化复杂对比信号的同时缓解空间对齐不佳的影响。

链接: https://arxiv.org/abs/2604.01053
作者: Zilong Li,Dongyang Li,Chenglong Ma,Zhan Feng,Dakai Jin,Junping Zhang,Hao Luo,Fan Wang,Hongming Shan
机构: Fudan University (复旦大学); Shanghai Center for Brain Science and Brain-inspired Technology (上海脑科学与脑启发技术研究中心); Zhejiang University (浙江大学); Alibaba DAMO Academy (阿里巴巴达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.

[CV-15] A global dataset of continuous urban dashcam driving

【速读】:该论文旨在解决当前城市道路感知模型在跨域鲁棒性和交互分析方面存在的局限性,特别是由于训练数据中充斥着事故、编辑内容或异常事件导致的泛化能力不足问题。解决方案的关键在于构建一个大规模、高质量、面向常规驾驶场景的自然主义视频数据集CROWD,其核心特征包括:(1)严格筛选排除事故及编辑内容,专注于日常驾驶行为;(2)覆盖全球六大洲7,103个有人居住地,确保地理多样性与代表性;(3)提供细粒度的段级标注(如时段和车辆类型)以及基于YOLOv11x和BoT-SORT的机器生成检测与多目标跟踪结果,从而降低基准测试门槛并支持可复现的研究。

链接: https://arxiv.org/abs/2604.01044
作者: Md Shadab Alam,Olena Bazilinska,Pavlo Bazilinskyy
机构: Eindhoven University of Technology (埃因霍温理工大学); National University of Kyiv-Mohyla Academy (基辅莫吉拉国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.

[CV-16] ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

【速读】:该论文旨在解决视频基础模型(Video Foundation Models, VFMs)在以人为中心的视频生成中,对主体与场景进行细粒度且独立编辑的难题。现有方法通常通过刚性三维几何结构引入环境控制,但面临精确控制与生成灵活性之间的权衡问题,且依赖复杂的3D预处理流程限制了实际可扩展性。解决方案的关键在于提出ONE-SHOT框架,其核心创新是将生成过程分解为解耦信号:一是引入规范空间注入机制(canonical-space injection),利用交叉注意力实现人体动态与环境线索的分离;二是设计动态接地位置编码(Dynamic-Grounded-RoPE),无需任何启发式3D对齐即可建立不同空间域间的空间对应关系;此外还提出混合上下文融合机制(Hybrid Context Integration),以保障分钟级长时视频生成中的主体与场景一致性。

链接: https://arxiv.org/abs/2604.01043
作者: Fengyuan Yang,Luying Huang,Jiazhi Guan,Quanwei Yang,Dongwei Pan,Jianglin Fu,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou,Angela Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: this https URL.

[CV-17] Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因标注成本高和临床需求差异导致的“部分标注”(partially labeled)问题,即图像中仅包含部分器官的标签,从而限制了模型性能。其解决方案的关键在于提出一种迭代提示与伪标签框架(IPnP),通过训练可微的分割网络(专家模型)与冻结的基础模型(通用模型)之间的协同作用,逐步生成并优化未标注器官的伪标签,从而逐步恢复全器官监督信号,最终提升分割性能。

链接: https://arxiv.org/abs/2604.01038
作者: Qiaochu Zhao,Wei Wei,David Horowitz,Richard Bakst,Yading Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures. Accepted for presentation at IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.

[CV-18] Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using Open-Source Photogrammetry

【速读】:该论文旨在解决月球表面高分辨率数字高程模型(Digital Elevation Model, DEM)的生成问题,以支持着陆区特征分析、表面移动规划及行星科学等应用。其解决方案的关键在于利用印度“月船2号”轨道器上的高分辨率全色相机(Orbiter High Resolution Camera, OHRC)多视角影像,通过完全开源的处理流程实现亚米级DEM的自动化生产;具体包括基于图像元数据的立体像对筛选(采用基线-高度比B/H和交会角估计)、密集立体匹配与光线三角测量生成点云,并通过迭代最近点(Iterative Closest Point, ICP)配准与常数偏移校正实现绝对高程一致性,最终在五个不同地理区域获得空间分辨率介于24–54 cm之间的高质量DEM,垂直精度达5.85 m(相对于LRO NAC参考地形),水平精度优于30 cm。

链接: https://arxiv.org/abs/2604.01032
作者: Aaranay Aadi,Jai Singla,Nitant Dube,Oleg Alexandrov
机构: Manipal University Jaipur (曼尼帕尔大学贾伊普尔分校); Space Applications Centre (SAC) (印度空间研究组织空间应用中心); NASA Ames Research Center (美国国家航空航天局艾姆斯研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:High-resolution digital elevation models (DEMs) of the lunar surface are essential for surface mobility planning, landing site characterization, and planetary science. The Orbiter High Resolution Camera (OHRC) on board Chandrayaan-2 has the best ground sampling capabilities of any lunar orbital imaging currently in use by acquiring panchromatic imagery at a resolution of roughly 20-30 cm per pixel. This work presents, for the first time, the generation of sub-metre DEMs from OHRC multi-view imagery using an exclusively open-source pipeline. Candidate stereo pairs are identified from non-paired OHRC archives through geometric analysis of image metadata, employing baseline-to-height (B/H) ratio computation and convergence angle estimation. Dense stereo correspondence and ray triangulation are then applied to generate point clouds, which are gridded into DEMs at effective spatial resolutions between approximately 24 and 54 cm across five geographically distributed lunar sites. Absolute elevation consistency is established through Iterative Closest Point (ICP) alignment against Lunar Reconnaissance Orbiter Narrow Angle Camera (NAC) Digital Terrain Models, followed by constant-bias offset correction. Validation against NAC reference terrain yields a vertical RMSE of 5.85 m (at native OHRC resolution), and a horizontal accuracy of less than 30 cm assessed by planimetric feature matching.

[CV-19] Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization WWW ATC

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)中两类方法的局限性:前馈模型虽推理速度快但重建质量有限,而逐场景优化虽能生成高质量结果但计算成本高昂。为融合两者优势,作者提出Diff3R框架,其核心在于将可微分的3DGS优化层嵌入训练流程,使网络学习预测一个最优初始化参数,用于测试时的优化而非直接输出零样本结果。关键创新包括:利用隐函数定理(Implicit Function Theorem)实现无需显式反向传播优化步骤的梯度计算,并设计一种面向3DGS优化的、矩阵无关的预条件共轭梯度(PCG)求解器以提升效率;同时引入数据驱动的不确定性建模机制,动态控制优化过程中参数变化幅度,从而缓解欠约束区域的过拟合并增强对输入异常值的鲁棒性。该优化层具有模型无关性,可无缝集成至现有前馈3DGS架构中,适用于有位姿给定和无位姿约束的场景。

链接: https://arxiv.org/abs/2604.01030
作者: Yueh-Cheng Liu,Jozef Hladký,Matthias Nießner,Angela Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Video: this https URL

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) present two main directions: feed-forward models offer fast inference in sparse-view settings, while per-scene optimization yields high-quality renderings but is computationally expensive. To combine the benefits of both, we introduce Diff3R, a novel framework that explicitly bridges feed-forward prediction and test-time optimization. By incorporating a differentiable 3DGS optimization layer directly into the training loop, our network learns to predict an optimal initialization for test-time optimization rather than a conventional zero-shot result. To overcome the computational cost of backpropagating through the optimization steps, we propose computing gradients via the Implicit Function Theorem and a scalable, matrix-free PCG solver tailored for 3DGS optimization. Additionally, we incorporate a data-driven uncertainty model into the optimization process by adaptively controlling how much the parameters are allowed to change during optimization. This approach effectively mitigates overfitting in under-constrained regions and increases robustness against input outliers. Since our proposed optimization layer is model-agnostic, we show that it can be seamlessly integrated into existing feed-forward 3DGS architectures for both pose-given and pose-free methods, providing improvements for test-time optimization.

[CV-20] Forecasting Motion in the Wild

【速读】:该论文旨在解决视觉系统缺乏对运动与行为的通用表示,从而难以准确预测复杂非刚性代理(如野生动物)未来行为的问题。其核心解决方案是提出**密集点轨迹(dense point trajectories)**作为行为的视觉标记(visual tokens),这是一种结构化的中层表征,能够将运动与外观解耦,并在多种非刚性物体上实现泛化;在此基础上设计的扩散变换器(diffusion transformer)可建模无序轨迹集合并显式推理遮挡关系,从而实现对复杂运动模式的连贯预测。

链接: https://arxiv.org/abs/2604.01015
作者: Neerja Thakkar,Shiry Ginosar,Jacob Walker,Jitendra Malik,Joao Carreira,Carl Doersch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.

[CV-21] AutoMIA: Improved Baselines for Membership Inference Attack via Agent ic Self-Exploration

【速读】:该论文旨在解决现有会员推理攻击(Membership Inference Attacks, MIAs)方法依赖静态手工特征设计、缺乏适应性,导致在不同大规模模型间迁移时性能不佳的问题。解决方案的关键在于提出AutoMIA框架,该框架将会员推理攻击重构为一个自动化的自我探索与策略演化过程:通过生成可执行的对数概率级别(logits-level)攻击策略,并基于闭环评估反馈逐步优化,从而实现对攻击空间的系统性、模型无关遍历。其核心创新在于解耦抽象策略推理与底层执行逻辑,显著提升了攻击策略的自适应性和泛化能力。

链接: https://arxiv.org/abs/2604.01014
作者: Ruhao Liu,Weiqi Huang,Qi Li,Xinchao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) serve as a fundamental auditing tool for evaluating training data leakage in machine learning models. However, existing methodologies predominantly rely on static, handcrafted heuristics that lack adaptability, often leading to suboptimal performance when transferred across different large models. In this work, we propose AutoMIA, an agentic framework that reformulates membership inference as an automated process of self-exploration and strategy evolution. Given high-level scenario specifications, AutoMIA self-explores the attack space by generating executable logits-level strategies and progressively refining them through closed-loop evaluation feedback. By decoupling abstract strategy reasoning from low-level execution, our framework enables a systematic, model-agnostic traversal of the attack search space. Extensive experiments demonstrate that AutoMIA consistently matches or outperforms state-of-the-art baselines while eliminating the need for manual feature engineering.

[CV-22] PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对对抗性图像扰动时的脆弱性问题,现有基于对抗训练的方法通常计算开销大且难以泛化到未见过的攻击类型。其解决方案的关键在于提出一种无需训练的防御框架——Paraphrase-Decomposition-Aggregation (PDA),该方法在推理阶段通过提示改写(prompt paraphrasing)、问题分解(question decomposition)和一致性聚合(consistency aggregation)来增强模型鲁棒性,且不需修改底层模型结构;同时,为平衡鲁棒性与效率,作者进一步设计了保持大部分鲁棒性收益的同时显著降低推理成本的不变量(invariants)实例,从而实现通用、高效且实用的VLM防御机制。

链接: https://arxiv.org/abs/2604.01010
作者: Jingning Xu,Haochen Luo,Chen Liu
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

[CV-23] Query-Conditioned Evidential Keyframe Sampling for MLLM -Based Long-Form Video Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频问答任务中因上下文长度限制和计算成本过高而导致的关键帧采样效率低下问题。现有方法通常依赖语义相关性或强化学习,前者难以捕捉回答问题所需的证据线索,后者则面临组合优化效率低下的挑战。解决方案的关键在于提出一种基于信息瓶颈理论的证据驱动型关键帧采样框架,将关键帧选择建模为最大化所选帧与问题之间的条件互信息(conditional mutual information),从而提供一个能准确反映每帧对问答贡献的理论目标;并通过结构分解将复杂的子集选择问题转化为独立的帧级评分机制,并引入一个查询条件化的证据评分网络(query-conditioned evidence scoring network),以对比学习目标进行训练,实现高效且精准的证据重要性估计。

链接: https://arxiv.org/abs/2604.01002
作者: Yiheng Wang,Lichen Zhu,Yueqian Lin,Yudong Liu,Jingyang Zhang,Hai “Helen” Li,Yiran Chen
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame’s contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

[CV-24] EgoSim: Egocentric World Simulator for Embodied Interaction Generation

【速读】:该论文旨在解决现有第一人称视角(egocentric)世界模拟器在空间一致性与动态场景状态更新方面的局限性:一方面,传统方法缺乏显式的三维(3D)结构建模,导致视点变化时出现结构漂移;另一方面,多数模型将场景视为静态,无法在多阶段交互中持续更新世界状态。其解决方案的关键在于提出EgoSim,一个闭环的第一人称世界模拟器,通过将3D场景建模为可更新的世界状态(world state),并结合几何-动作感知的观察模拟模型(Geometry-action-aware Observation Simulation model)和交互感知的状态更新模块(Interaction-aware State Updating module),实现空间一致的交互视频生成与场景状态的持续演化。此外,作者设计了可扩展的数据采集流程及低成本的EgoCap系统,有效缓解了高质量场景-交互对齐数据稀缺的问题,从而显著提升了模拟的视觉质量、空间一致性及对复杂场景和真实操作的泛化能力。

链接: https://arxiv.org/abs/2604.01001
作者: Jinkun Hao,Mingda Jia,Ruiyan Wang,Xihui Liu,Ran Yi,Lizhuang Ma,Jiangmiao Pang,Xudong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this http URL

点击查看摘要

Abstract:We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at this http URL.

[CV-25] Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise

【速读】:该论文旨在解决陆地地震和垂直地震剖面(VSP)数据中地面滚波(ground-roll)这一强相干噪声对反射事件的严重掩蔽问题,传统方法如变换域滤波、稀疏表示及深度学习等在信号-噪声重叠强烈时普遍存在适应性差、信号泄漏或依赖标注训练数据的局限。其解决方案的关键在于提出一种无需训练的框架,将地面滚波抑制重构为语义引导的信号分离问题:利用可提示的大视觉模型(promptable large vision model)将地震 gathers 转换为视觉表征,并通过文本或图像提示定位地面滚波主导区域,生成连续软掩膜(soft mask),嵌入到掩膜条件下的低秩逆问题中实现空间自适应抑制与反射保留重建;进一步采用基于交替方向乘子法(ADMM)的高效求解器,无需任务特定训练或人工标注即可稳定恢复物理一致的信号。

链接: https://arxiv.org/abs/2604.00998
作者: Jiacheng Liao,Feng Qian,Ziyin Fan,Yongjian Guo
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.

[CV-26] Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging

【速读】:该论文旨在解决仅使用T2加权磁共振成像(T2-weighted MRI, T2w)在推理阶段进行前列腺癌精确定位的问题,而训练阶段可利用扩散加权成像(diffusion-weighted imaging, DWI)作为潜在模态(latent modality)。其核心挑战在于如何在不依赖DWI输入的情况下实现与多模态方法相当甚至更优的癌症定位性能。解决方案的关键在于提出一种基于期望最大化(expectation-maximization, EM)框架的新型学习范式:在E步中,通过基于流匹配(flow matching)的生成模型近似DWI图像的后验分布;在M步中,联合优化癌症定位器与生成模型,以最大化癌症存在性的期望似然。该方法实现了从“特权模态”(privileged modality)中提取信息并转化为单模态推理能力,显著优于缺乏训练阶段DWI信息或传统特权学习框架的方法,在患者层面F1分数和区域层面QWK指标上均取得提升。

链接: https://arxiv.org/abs/2604.00985
作者: Weixi Yi,Yipei Wang,Wen Yan,Hanyuan Zhang,Natasha Thorley,Alexander Ng,Shonit Punwani,Fernando Bianco,Mark Emberton,Veeru Kasivisvanathan,Dean C. Barratt,Shaheer U. Saeed,Yipeng Hu
机构: University College London (伦敦大学学院); Stanford University (斯坦福大学); University of Cambridge (剑桥大学); OHSU Knight Cancer Institute (俄勒冈健康与科学大学凯特癌症研究所); University of Manchester (曼彻斯特大学); University of Bristol (布里斯托大学); Queen Mary University of London (伦敦玛丽女王大学); Urological Research Network (泌尿研究网络)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4% and zone-level QWK by 5.3% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.

[CV-27] ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的严重幻觉问题,即模型在生成过程中因静态上下文处理机制而无法有效应对动态语境变化,导致信息丢失与错误推理。其解决方案的关键在于提出一种无需训练的推理干预方法——自适应上下文整合(Adaptive Context inTegration, ACT),通过两个核心机制实现:一是视觉上下文探索(visual context exploration),利用时空特征分析自适应增强负责视觉探索的注意力头;二是语义上下文聚合(semantic context aggregation),通过对潜在语义查询进行边缘化处理以聚合视觉证据,从而缓解离散token预测带来的信息损失。该方法显著降低幻觉率,并在判别式与生成式基准上均取得优异性能,且不损害模型的基础生成能力。

链接: https://arxiv.org/abs/2604.00983
作者: Bei Yan,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.

[CV-28] DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving CVPR2026

【速读】:该论文旨在解决视觉感知驱动的自动驾驶系统中,如何实现高效且具泛化能力的3D场景建模与多任务协同学习的问题。现有方法如密集BEV(Bird’s Eye View)或稀疏query模型在表达效率与任务适应性上存在局限,而基于3D语义高斯(Gaussian-centric)的表示虽具备稀疏但全面的场景描述能力,却缺乏统一的预训练范式以支持下游任务。为此,作者提出DLWM(Dual Latent World Models),其核心创新在于双阶段预训练机制:第一阶段通过自监督重建多视角语义与深度图像,从查询中预测3D高斯;第二阶段引入两个独立的潜在世界模型,分别进行时序特征学习——其中高斯流引导的潜在预测用于占据感知与4D占据预测任务,而本体规划引导的潜在预测则服务于运动规划任务。此设计实现了高斯中心表征下的端到端多任务协同预训练,显著提升了多个自动驾驶关键任务的性能。

链接: https://arxiv.org/abs/2604.00969
作者: Yiyao Zhu,Ying Xue,Haiming Zhang,Guangfeng Jiang,Wending Zhou,Xu Yan,Jiantao Gao,Yingjie Cai,Bingbing Liu,Zhen Li,Shaojie Shen
机构: HKUST (香港科技大学); CUHK-SZ (香港中文大学深圳校区); USTC (中国科学技术大学); Huawei Foundation Model Department (华为基础模型部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

[CV-29] Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中梯度传输导致的隐私泄露问题,特别是现有梯度逆向攻击方法在生成对抗网络(Generative Adversarial Networks, GAN)潜在空间中进行优化时表达能力有限、泛化性不足的问题。其解决方案的关键在于提出一种名为梯度逆向特征域搜索(Gradient Inversion over Feature Domains, GIFD)的新方法:通过拆解GAN模型,在中间层特征空间中逐层优化,而非仅局限于初始潜在码空间;同时引入 $ l_1 $ 球约束正则项以抑制不合理图像生成,并进一步扩展至分布外(out-of-distribution, OOD)场景,结合标签映射技术有效应对标签不一致问题,从而实现像素级数据重建并显著优于现有基线方法。

链接: https://arxiv.org/abs/2604.00955
作者: Hao Fang,Wenbo Yu,Bin Chen,Xuan Wang,Shu-Tao Xia,Qing Liao,Ke Xu
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a compelling paradigm for privacy-preserving distributed machine learning, allowing multiple clients to collaboratively train a global model by transmitting locally computed gradients to a central server without exposing their private data. Nonetheless, recent studies find that the gradients exchanged in the FL system are also vulnerable to privacy leakage, e.g., an attacker can invert shared gradients to reconstruct sensitive data by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, existing attacks simply perform gradient inversion in the latent space of the GAN model, which limits their expression ability and generalizability. To tackle these challenges, we propose \textbfGradient \textbfInversion over \textbfFeature \textbfDomains (GIFD), which disassembles the GAN model and searches the hierarchical features of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small l_1 ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Furthermore, we consider the challenging OOD scenario of label inconsistency and propose a label mapping technique as an effective solution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and outperform competitive baselines across a variety of FL scenarios.

[CV-30] YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

【速读】:该论文旨在解决作物产量预测(crop yield prediction)中因数据获取成本高、质量不一及隐私限制导致的数据集稀缺与局限性问题,从而阻碍了可扩展的数据驱动模型发展。其解决方案的关键在于构建并发布YieldSAT——一个大规模、高质量、多模态的作物产量预测数据集,覆盖多个国家和气候区,包含多种主要农作物(如玉米、油菜、大豆和小麦),共2,173个专家标注田块,提供超过1220万条10米空间分辨率的产量样本及配套多光谱卫星影像(113,555张带标签图像)与辅助环境数据。研究进一步通过对比深度学习模型与数据融合架构,验证了高分辨率像素回归任务在作物产量预测中的潜力,并提出基于领域知识的深度集成方法(domain-informed Deep Ensemble),有效缓解真实场景下因地面真值数据分布偏移(distribution shifts)带来的性能下降问题。

链接: https://arxiv.org/abs/2604.00940
作者: Miro Miranda,Deepak Pathak,Patrick Helber,Benjamin Bischke,Hiba Najjar,Francisco Mena,Cristhian Sanchez,Akshay Pai,Diego Arenas,Matias Valdenegro-Toro,Marcela Charfuelan,Marlon Nuske,Andreas Dengel
机构: RPTU Kaiserslautern-Landau (莱茨堡大学); DFKI GmbH (德国人工智能研究中心); Vision Impulse GmbH (视觉脉冲有限公司); University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at this https URL.

[CV-31] EmoScene: A Dual-space Dataset for Controllable Affective Image Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到图像扩散模型中对场景语义和细粒度情感基调(affective tone)控制不足的问题。当前模型通常未能将情感维度(如效价 valence、唤醒 arousal 和支配感 dominance)与感知特征(如色彩和谐、亮度对比、纹理变化等)统一建模,导致生成场景难以体现连贯且细腻的情感意图。解决方案的关键在于构建 EmoScene——一个大规模双空间情绪数据集,联合编码情感维度与感知属性,并提供上下文语义作为辅助标注;同时设计了一个轻量级基线模型,通过浅层交叉注意力调制将双空间控制注入冻结的扩散主干网络,从而实现基于双重监督的可控情感生成能力。

链接: https://arxiv.org/abs/2604.00933
作者: Li He,Longtai Zhang,Wenqiang Zhang,Yan Wang,Lizhe Qi
机构: Fudan University (复旦大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.

[CV-32] Autoregressive Appearance Prediction for 3D Gaussian Avatars

【速读】:该论文旨在解决生成逼真且沉浸式人类虚拟形象(Avatar)时面临的挑战,即如何准确捕捉个体特有的细节(如衣物和头发动态、细微面部表情及特征性运动模式),同时避免因相似姿态对应不同外观而导致的歧义与虚假相关性问题。传统模型在训练中容易过拟合,从而在新姿态下产生不稳定甚至突兀的外观变化。其解决方案的关键在于提出一种基于3D高斯点绘(3D Gaussian Splatting)的虚拟形象模型,采用空间MLP(多层感知机)作为主干网络,并引入一个由编码器学习得到的外观潜在变量(appearance latent),该变量能够压缩并区分姿态驱动的渲染结果,显著提升重建质量与稳定性;在驱动阶段,通过自回归方式预测该潜在变量,实现时间上平滑的外观演化,从而保障虚拟形象在动态场景中的鲁棒性和实用性。

链接: https://arxiv.org/abs/2604.00928
作者: Michael Steiner,Zhang Chen,Alexander Richard,Vasu Agrawal,Markus Steinberger,Michael Zollhöfer
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.

[CV-33] Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

【速读】:该论文旨在解决基于动作的舞蹈检索(motion-based dance retrieval)问题,即直接从原始视频中识别语义上相似的编舞(choreographies),并提出了一种名为DANCEMATCH的端到端框架实现“舞蹈指纹”(DANCE FINGERPRINTING)。传统方法依赖连续嵌入表示姿态序列,难以索引、解释或扩展至大规模场景;而本文的关键解决方案在于构建紧凑且离散的动作签名(motion signatures),通过Skeleton Motion Quantisation (SMQ) 与 Spatio-Temporal Transformers (STT) 将Apple CoMotion提取的人体姿态编码为结构化的运动词汇,从而保留舞蹈的时空结构特征,并支持高效的大规模检索。进一步设计的DANCE RETRIEVAL ENGINE (DRE) 利用基于直方图的索引实现亚线性检索,并结合重排序提升匹配精度,显著提升了跨舞种和未见编舞的泛化能力。

链接: https://arxiv.org/abs/2604.00927
作者: Arina Kharlamova,Bowei He,Chen Ma,Xue Liu
机构: MBZUAI, Abu Dhabi, UAE; City University of Hong Kong, Hong Kong, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.

[CV-34] Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

【速读】:该论文旨在解决预训练图像编码器(pretrained image encoders)生成的表征存在冗余且模型特定的问题,这些问题导致表征效率低下,限制了跨任务和跨模型的复用效果。解决方案的关键在于提出一种无需训练的后处理方法——基于典型相关分析(canonical correlation analysis, CCA)的线性投影操作,通过挖掘两个预训练图像编码器之间共享的结构,自动识别并保留具有语义一致性的公共维度,同时剔除冗余信息,从而实现更高效、更具泛化能力的表征压缩与优化。此方法在保持高精度的同时,可将表征维度降低超过75%,或在固定维度下通过迁移更大或微调后的模型表征提升性能,在多个基准数据集上均取得显著优于基线和PCA投影的结果。

链接: https://arxiv.org/abs/2604.00921
作者: Dylan B. Lewis,Jens Gregor,Hector Santos-Villalobos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

[CV-35] ProCap: Projection-Aware Captioning for Spatial Augmented Reality

【速读】:该论文旨在解决空间增强现实(Spatial Augmented Reality, SAR)中虚拟内容与物理场景语义混淆的问题,即标准视觉语言模型(Vision Language Models, VLMs)难以区分投影内容与真实物理环境,导致智能交互(如场景推理或用户问答)性能受限。解决方案的关键在于提出ProCap框架,其核心是通过两阶段流程实现虚拟与物理层的显式解耦:首先利用自动分割技术在视觉层面分离投影内容与物理场景;其次采用区域感知检索机制避免因投影失真引发的语义歧义。该方法为SAR提供了可靠的语义基础,支撑更精准的智能交互任务。

链接: https://arxiv.org/abs/2604.00912
作者: Zimo Cao,Yuchen Deng,Haibin Ling,Bingyao Huang
机构: Southwest University (西南大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: this https URL.

[CV-36] JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

【速读】:该论文旨在解决当前日本语视觉问答(VQA)基准测试中存在的可靠性问题,如问题表述模糊、答案错误以及无需视觉理解即可解答的样本等,这些问题会误导模型比较结果。解决方案的关键在于构建一个经过系统性优化的日本语基准集合——JAMMEval,其通过两轮人工标注对七个现有日语VQA数据集进行精细化修正,从而显著提升数据质量与评估可靠性。实验表明,该方法能更准确反映模型能力、降低评估结果的波动性,并增强区分不同性能水平模型的能力。

链接: https://arxiv.org/abs/2604.00909
作者: Issa Sugiura,Koki Maeda,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Naoaki Okazaki
机构: Kyoto University (京都大学); NII LLMC (日本信息研究所语言模型与认知研究中心); NII (日本信息研究所); Waseda University (早稻田大学); Institute of Science Tokyo (东京科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.

[CV-37] IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

【速读】:该论文旨在解决个性化文本到图像扩散模型(如DreamBooth、LoRA)在社交平台公开生成内容时引发的身份泄露问题,即即便个人化过程是授权的,其输出仍可能通过人脸识别系统与真实用户关联,导致身份追踪和画像风险。现有防御策略主要采用“反个性化”方法,通过干扰模型微调来保护参考图像,但无法应对授权个性化场景下的隐私泄露。为此,作者提出一种新的防御范式——模型侧输出免疫(model-side output immunization),其核心在于设计Identity-Decoupled personalized Diffusion Models (IDDM),将身份解耦机制整合进个性化流程中,通过交替执行短周期个性化更新与身份解耦数据优化,并采用两阶段调度策略,在保持高质量生成能力的同时显著降低身份可链接性(identity linkability),实现隐私与实用性的可控权衡。

链接: https://arxiv.org/abs/2604.00903
作者: Linyan Dai,Xinwei Zhang,Haoyang Li,Qingqing Ye,Haibo Hu
机构: The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.00903 [cs.CV] (or arXiv:2604.00903v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.00903 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-38] Super-Resolving Coarse-Resolution Weather Forecasts With Flow Matching

【速读】:该论文旨在解决高分辨率机器学习天气预报模型在训练与运行时计算成本过高的问题。其核心解决方案是提出一种模块化框架,将预测过程与空间分辨率解耦:先以低分辨率生成预报轨迹,再通过学习的生成式超分辨率(Generative Super-Resolution)作为后处理步骤恢复高分辨率细节。关键创新在于将超分辨率建模为一个随机逆问题,并采用残差形式(residual formulation)来保留大尺度结构的同时重建未解析的小尺度变率,且仅需在再分析数据上使用流匹配(flow matching)进行训练,无需端到端高分辨率建模即可实现接近操作型集合预报系统的概率预报技能(0.25°分辨率下),同时显著降低额外训练开销。

链接: https://arxiv.org/abs/2604.00897
作者: Aymeric Delefosse,Anastase Charantonis,Dominique Béréziat
机构: Inria, ARCHES, Paris, France; Sorbonne Université, CNRS, LIP6, Paris, France
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Climate Informatics 2026

点击查看摘要

Abstract:Machine learning-based weather forecasting models now surpass state-of-the-art numerical weather prediction systems, but training and operating these models at high spatial resolution remains computationally expensive. We present a modular framework that decouples forecasting from spatial resolution by applying learned generative super-resolution as a post-processing step to coarse-resolution forecast trajectories. We formulate super-resolution as a stochastic inverse problem, using a residual formulation to preserve large-scale structure while reconstructing unresolved variability. The model is trained with flow matching exclusively on reanalysis data and is applied to global medium-range forecasts. We evaluate (i) design consistency by re-coarsening super-resolved forecasts and comparing them to the original coarse trajectories, and (ii) high-resolution forecast quality using standard ensemble verification metrics and spectral diagnostics. Results show that super-resolution preserves large-scale structure and variance after re-coarsening, introduces physically consistent small-scale variability, and achieves competitive probabilistic forecast skill at 0.25° resolution relative to an operational ensemble baseline, while requiring only a modest additional training cost compared with end-to-end high-resolution forecasting.

[CV-39] Adversarial Attenuation Patch Attack for SAR Object Detection

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)目标检测系统在面对对抗攻击时的脆弱性问题,尤其是现有攻击方法存在扰动明显、难以物理实现等局限。其解决方案的关键在于提出一种新型的对抗衰减补丁(Adversarial Attenuation Patch, AAP)方法,该方法通过能量约束优化策略与基于衰减机制的部署框架,在攻击有效性与隐蔽性之间实现平衡,并且能够契合信号级电子干扰原理,从而具备良好的物理可实现潜力。实验表明,AAP不仅显著降低检测性能,还保持高不可察觉性并具有跨模型迁移能力,为SAR目标检测系统的物理层面对抗攻击提供了新视角和可行路径。

链接: https://arxiv.org/abs/2604.00887
作者: Yiming Zhang,Weibo Qin,Feng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 5 pages, 4 figures. Source code is available at this https URL

点击查看摘要

Abstract:Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at this https URL.

[CV-40] A 4D Representation for Training-Free Agent ic Reasoning from Monocular Laparoscopic Video

【速读】:该论文旨在解决软组织手术中人工智能(AI)系统在时空推理能力上的不足,尤其是在理解手术视频时缺乏对时间与三维空间的显式建模问题。其解决方案的关键在于提出一种基于显式4D表示的框架,将点跟踪、深度估计和分割模型融合为一个时空一致的4D模型,从而为工具和组织提供时空语义信息;在此基础上,无需微调即可利用多模态大语言模型(MLLM)作为代理,基于该4D表示中的轨迹等工具进行自然语言推理,实现对时空事件的精准定位与理解。

链接: https://arxiv.org/abs/2604.00867
作者: Maximilian Fehrentz,Nicolas Stellwag,Robert Wiebe,Nicole Thorisch,Fabian Grob,Patrick Remerscheid,Ken-Joel Simmoteit,Benjamin D. Killeen,Christian Heiliger,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be “assembled” from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at this https URL

[CV-41] Shape Representation using Gaussian Process mixture models

【速读】:该论文旨在解决传统显式三维表示(如点云和网格)在存储空间需求高、表面查找索引复杂等问题,从而提出一种高效、紧凑且连续的函数型形状表示方法。其解决方案的关键在于利用高斯过程(Gaussian Process, GP)混合模型来建模表面几何结构,通过从稀疏采样的点云中学习连续的方向距离场,实现对复杂拓扑的精确表达;同时,通过在关键参考点上锚定局部GP先验,结合任意结构分解方法(如骨骼化或基于距离的聚类)灵活提取参考点,使模型兼具轻量化与高表达能力。

链接: https://arxiv.org/abs/2604.00862
作者: Panagiotis Sapoutzoglou,George Terzakis,Georgios Floros,Maria Pateraki
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ISPRS 2026

点击查看摘要

Abstract:Traditional explicit 3D representations, such as point clouds and meshes, demand significant storage to capture fine geometric details and require complex indexing systems for surface lookups, making functional representations an efficient, compact, and continuous alternative. In this work, we propose a novel, object-specific functional shape representation that models surface geometry with Gaussian Process (GP) mixture models. Rather than relying on computationally heavy neural architectures, our method is lightweight, leveraging GPs to learn continuous directional distance fields from sparsely sampled point clouds. We capture complex topologies by anchoring local GP priors at strategic reference points, which can be flexibly extracted using any structural decomposition method (e.g. skeletonization, distance-based clustering). Extensive evaluations on the ShapeNetCore and IndustryShapes datasets demonstrate that our method can efficiently and accurately represent complex geometries.

[CV-42] Sparkle: A Robust and Versatile Representation for Point Cloud based Human Motion Capture ICLR2026

【速读】:该论文旨在解决基于点云的人体动作捕捉中如何构建一个既能保持高表达性又具备强鲁棒性的表征问题,以克服现有方法在点基(几何细节丰富但噪声大)与骨架基(鲁棒但过于简化)表示之间难以平衡的困境。其解决方案的关键在于提出一种名为Sparkle的结构化表征,通过显式地将内部运动学结构(kinematic structure)与外部表面几何(surface geometry)进行因子分解,从而实现对人类运动的精细化建模;在此基础上,SparkleMotion框架利用分层模块嵌入几何连续性和运动学约束,显著提升了模型在严重域偏移、噪声和遮挡等复杂场景下的准确率、鲁棒性及泛化能力。

链接: https://arxiv.org/abs/2604.00857
作者: Yiming Ren,Yujing Sun,Aoru Xue,Kwok-Yan Lam,Yuexin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness. In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.

[CV-43] Perturb-and-Restore: Simulation-driven Structural Augmentation Framework for Imbalance Chromosomal Anomaly Detection ALT

【速读】:该论文旨在解决遗传病诊断中结构染色体异常检测面临的样本严重不平衡与稀缺问题,尤其在临床实践中难以获取足够多样化的异常染色体数据,导致深度学习模型性能显著下降。其解决方案的关键在于提出一种名为“扰动与恢复”(Perturb-and-Restore, PR)的模拟驱动型结构增强框架,该框架包含两个核心组件:一是通过扰动正常染色体带纹模式并利用恢复扩散网络重建连续染色体内容和边界,生成合成异常染色体以减少对稀有真实异常样本的依赖;二是基于能量评分的自适应采样策略,动态筛选高质量合成样本,从而提升模型训练效率与泛化能力。

链接: https://arxiv.org/abs/2604.00854
作者: Yilan Zhang,Hanbiao Chen,Changchun Yang,Yuetan Chu,Siyuan Chen,Jing Wu,Jingdong Hu,Na Li,Junkai Su,Yuxuan Chen,Ao Xu,Xin Gao,Aihua Yin
机构: King Abdullah University of Science and Technology (KAUST); Guangdong Provincial Maternal and Child Health Hospital; Smiltec
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint version of the manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics (JBHI) for review

点击查看摘要

Abstract:Detecting structural chromosomal abnormalities is crucial for accurate diagnosis and management of genetic disorders. However, collecting sufficient structural abnormality data is extremely challenging and costly in clinical practice, and not all abnormal types can be readily collected. As a result, deep learning approaches face significant performance degradation due to the severe imbalance and scarcity of abnormal chromosome data. To address this challenge, we propose a Perturb-and-Restore (PR), a simulation-driven structural augmentation framework that effectively alleviates data imbalance in chromosome anomaly detection. The PR framework comprises two key components: (1) Structure Perturbation and Restoration Simulation, which generates synthetic abnormal chromosomes by perturbing chromosomal banding patterns of normal chromosomes followed by a restoration diffusion network that reconstructs continuous chromosome content and edges, thus eliminating reliance on rare abnormal samples; and (2) Energy-guided Adaptive Sampling, an energy score-based online selection strategy that dynamically prioritizes high-quality synthetic samples by referencing the energy distribution of real samples. To evaluate our method, we construct a comprehensive structural anomaly dataset consisting of over 260,000 chromosome images, including 4,242 abnormal samples spanning 24 categories. Experimental results demonstrate that the PR framework achieves state-of-the-art (SOTA) performance, surpassing existing methods with an average improvement of 8.92% in sensitivity, 8.89% in precision, and 13.79% in F1-score across all categories.

[CV-44] MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer

【速读】:该论文旨在解决基于扩散 Transformer(Diffusion Transformer, DiT)的视频生成方法在多对象场景中缺乏细粒度控制的问题,现有方法仅适用于单对象视频,难以实现对复杂现实场景中多个物体运动的精确迁移。解决方案的关键在于提出 MotionGrounder 框架,其核心创新包括:(1)引入基于流的运动信号(Flow-based Motion Signal, FMS),为目标视频生成提供稳定且可分解的运动先验;(2)设计物体-文本对齐损失(Object-Caption Alignment Loss, OCAL),将物体描述词与生成视频中的空间区域进行精准锚定;(3)提出新的物体接地评分(Object Grounding Score, OGS),综合评估生成对象与源视频对象的空间对应关系及语义一致性。这些机制共同实现了多对象可控的运动迁移,显著提升了生成视频的质量和可控性。

链接: https://arxiv.org/abs/2604.00853
作者: Samuel Teodoro,Yun Chen,Agus Gunawan,Soo Ye Kim,Jihyong Oh,Munchurl Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.

[CV-45] Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation CVPR2026

【速读】:该论文旨在解决**主体驱动的文本到图像生成(Subject-Driven Text-to-Image Generation)中的“相似性-可控性悖论”(similarity-controllability paradox),即在保持主体身份高保真度的同时实现精确的文本控制往往难以兼顾的问题。其核心解决方案是提出DisCo框架,关键在于通过先解耦(Disntangle)后重构耦合(re-Couple)**的方式分离视觉与文本信息:首先利用参考图像和主体实体词提取主体身份,同时将文本提示简化为仅含修改指令的命令式描述(使用泛指代词指代主体),从而消除语义歧义;随后设计专用奖励信号并结合强化学习,使模型重新融合视觉定义的主体与文本生成的上下文,实现自然且一致的图像合成。该方法有效打破了传统范式中对文本提示的双重角色依赖,显著提升了主体保真度与文本控制精度的协同能力。

链接: https://arxiv.org/abs/2604.00849
作者: Shuang Li,Chao Deng,Hang Chen,Liqun Liu,Zhenyu Hu,Te Cao,Mengge Xue,Yuan Chen,Peng Shu,Huan Yu,Jie Jiang
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (Main)

点击查看摘要

Abstract:Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject’s identity while editing its context based on a text prompt. A core challenge in this task is the “similarity-controllability paradox”, where enhancing textual control often degrades the subject’s fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.

[CV-46] Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction CVPR’26

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在视频理解任务中计算成本过高、难以高效部署的问题。现有方法如 Patch Pruning 虽能节省计算资源,但仅限于模型深层进行令牌(token)剪枝,忽视了早期层的压缩潜力,限制了整体效率提升。解决方案的关键在于提出一种新型视频补丁剪枝框架(Video Patch Pruning, VPP),通过引入时序先验知识,在 ViT 的早期网络阶段实现高效稀疏性。其核心创新是一个可微分的时序映射模块,利用深层特征提取出的强前景选择性来精准筛选早期阶段中最相关的补丁,从而在保持性能稳定的同时实现高达 60% 的补丁减少,显著优于传统图像级剪枝方法(通常仅支持约 30% 稀疏度)。

链接: https://arxiv.org/abs/2604.00827
作者: Patrick Glandorf,Thomas Norrenbrock,Bodo Rosenhahn
机构: L3S - Leibniz University Hannover (莱布尼茨汉诺威大学L3S研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR’26 Workshops

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.

[CV-47] Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis

【速读】:该论文旨在解决当前遥感视觉-语言模型(Remote Sensing Vision-Language Models, RS VLMs)在面对持续涌现的新传感模态和下游任务时,因依赖静态训练数据而导致的灾难性遗忘问题,从而限制其持续适应能力。解决方案的关键在于构建了一个名为CLeaRS的综合性基准,包含10个精心设计的数据子集(超过207k张图像-文本对),覆盖多样化的解释任务、传感模态和应用场景,并定义了三种评估协议:长期 horizon 设置、模态增量设置和任务增量设置,以系统性地评测RS VLMs的持续学习性能。通过该基准对多种视觉-语言模型的广泛测试,揭示了现有方法在任务、指令和模态转换中均存在显著遗忘现象,凸显了开发专为RS VLMs定制的持续学习方法的必要性。

链接: https://arxiv.org/abs/2604.00820
作者: Xingxing Weng,Ruifeng Ni,Chao Pang,XiangYu Hao,Yishan Wang,Xiaokang Zhang,Wei Xu,Gui-Song Xia
机构: Wuhan University; Beijing Normal University; Shanxi Agricultural University; Wuhan University; Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.

[CV-48] Multicentric thrombus segmentation using an attention-based recurrent network with gradual modality dropout

【速读】:该论文旨在解决3D脑部影像中微小病灶(如缺血性卒中中的罪犯血栓)的检测与分割问题,此类病灶通常具有尺寸小、对比度低、模态表达不一致等特点,且多中心数据存在域偏移(domain shift)、各向异性(anisotropy)及序列缺失等挑战。其解决方案的关键在于提出一种基于注意力机制的循环分割网络(UpAttLLSTM),该网络通过2.5D递归单元聚合切片间上下文信息,并利用注意力门融合不同模态的互补特征,从而提升对各向异性和类别不平衡的鲁棒性;同时引入渐进式模态丢弃策略,在训练中模拟多中心数据的异质性与缺失情况,实现增强与正则化双重效果,显著改善跨中心泛化能力。

链接: https://arxiv.org/abs/2604.00817
作者: Sofia Vargas-Ibarra,Vincent Vigneron,Hichem Maaref,Sonia Garcia-Salicetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Detecting and delineating tiny targets in 3D brain scans is a central yet under-addressed challenge in medical this http URL ischemic stroke, for instance, the culprit thrombus is small, low-contrast, and variably expressed across modalities(e.g., susceptibility-weighted T2 blooming, diffusion restriction on DWI/ADC), while real-world multi-center dataintroduce domain shifts, anisotropy, and frequent missing sequences. We introduce a methodology that couples an attention-based recurrent segmentation network (UpAttLLSTM), a training schedule that progressively increases the difficulty of hetero-modal learning, with gradual modality dropout, UpAttLLSTM aggregates context across slices via recurrent units (2.5D) and uses attention gates to fuse complementary cues across available sequences, making it robust to anisotropy and class imbalance. Gradual modality dropout systematically simulates site heterogeneity,noise, and missing modalities during training, acting as both augmentation and regularization to improve multi-center generalization. On a monocentric cohort, our approach detects thrombi in 90% of cases with a Dice score of 0.65. In a multi-center setting with missing modalities, it achieves-80% detection with a Dice score around 0.35. Beyond stroke, the proposed methodology directly transfers to other small-lesion tasks in 3D medical imaging where targets are scarce, subtle, and modality-dependent

[CV-49] DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

【速读】:该论文旨在解决当前端到端自动驾驶系统中因依赖稀疏感知而限制决策能力的问题,尤其针对现有视觉-语言-动作(VLA)模型虽引入语言描述辅助规划但未能充分利用三维空间信息的局限性。其核心解决方案是提出一种新的视觉-几何-动作(VGA)范式,强调稠密三维几何(dense 3D geometry)作为自主驾驶的关键线索,并设计了流式处理的Driving Visual Geometry Transformer(DVGT-2),通过时序因果注意力机制和历史特征缓存实现在线推理,同时采用滑动窗口策略优化计算效率,从而在保证实时性的同时显著提升几何重建精度,并可直接迁移至不同相机配置下的轨迹规划任务,无需微调。

链接: https://arxiv.org/abs/2604.00813
作者: Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Hanbing Li,Long Chen,Zhi-Xin Yang,Jiwen Lu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tencent AI Lab (腾讯人工智能实验室); 3. Alibaba Cloud (阿里云); 4. National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code is available at \href{ this https URL }

点击查看摘要

Abstract:End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

[CV-50] Compact Keyframe-Optimized Multi-Agent Gaussian Splatting SLAM

【速读】:该论文旨在解决多智能体RGB-D高斯点云SLAM(Simultaneous Localization and Mapping)系统在受限通信带宽下,因密集地图表示导致实时数据传输效率低的问题。其核心挑战在于如何在不显著降低地图保真度的前提下减少通信负载。解决方案的关键在于:首先,在SLAM系统中引入压缩步骤以移除冗余的3D高斯分布(3D Gaussians),从而在保持渲染质量的同时降低数据量;其次,提出两种中央闭环计算模式——纯渲染深度模式(仅依赖3D高斯分布)和相机深度模式(额外使用轻量级深度图以提升配准精度并进一步修剪高斯分布),实现高效且鲁棒的地图融合与优化。实验表明,该方法可在合成与真实数据集上实现高达85–95%的数据传输量减少,显著推动了3D高斯多智能体SLAM在实际场景中的部署可行性。

链接: https://arxiv.org/abs/2604.00804
作者: Monica M.Q. Li,Pierre-Yves Lajoie,Jialiang Liu,Giovanni Beltrame
机构: Polytechnique Montreal (蒙特利尔理工学院); University of Wisconsin (威斯康星大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient multi-agent 3D mapping is essential for robotic teams operating in unknown environments, but dense representations hinder real-time exchange over constrained communication links. In multi-agent Simultaneous Localization and Mapping (SLAM), systems typically rely on a centralized server to merge and optimize the local maps produced by individual agents. However, sharing these large map representations, particularly those generated by recent methods such as Gaussian Splatting, becomes a bottleneck in real-world scenarios with limited bandwidth. We present an improved multi-agent RGB-D Gaussian Splatting SLAM framework that reduces communication load while preserving map fidelity. First, we incorporate a compaction step into our SLAM system to remove redundant 3D Gaussians, without degrading the rendering quality. Second, our approach performs centralized loop closure computation without initial guess, operating in two modes: a pure rendered-depth mode that requires no data beyond the 3D Gaussians, and a camera-depth mode that includes lightweight depth images for improved registration accuracy and additional Gaussian pruning. Evaluation on both synthetic and real-world datasets shows up to 85-95% reduction in transmitted data compared to state-of-the-art approaches in both modes, bringing 3D Gaussian multi-agent SLAM closer to practical deployment in real-world scenarios. Code: this https URL

[CV-51] HICT: High-precision 3D CBCT reconstruction from a single X-ray

【速读】:该论文旨在解决基于单张低剂量全景X线片(panoramic X-ray, PX)重建高保真、几何一致的三维锥形束CT(cone-beam CT, CBCT)图像的难题,该问题在临床中具有重要意义但长期受限于几何不一致性与重建精度不足。解决方案的关键在于提出一个两阶段框架HiCT:第一阶段利用视频扩散模型从单张全景X线生成几何一致的多视角投影图;第二阶段通过基于射线的动态注意力网络和X线采样策略,从这些投影中重建高质量CBCT图像。此外,作者构建了大规模配对数据集XCT(包含500对PX-CBCT样本),为方法训练与验证提供了坚实基础。

链接: https://arxiv.org/abs/2604.00792
作者: Wen Ma,Jiaxiang Liu,Zikai Xiao,Ziyang Wang,Feng Yang,Zuozhu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT’s high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.

[CV-52] An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

【速读】:该论文旨在解决当前外科视觉-语言模型(Vision-Language Models, VLMs)在理解手术视频中细粒度时空动态关系时存在的数据匮乏与评估不足问题。现有外科视觉-语言数据集难以捕捉和评估复杂的、交错的时空交互,且大规模高质量标注成本高昂或依赖大语言模型生成易引入误差。其解决方案的关键在于提出SurgSTU-Pipeline——一个确定性生成流程,通过引入时间连续性和空间连续性过滤机制,可靠地构建用于细粒度时空多模态理解的外科数据集。该方法应用于公开数据集后生成了包含7515个视频片段及15万条细粒度时空问答样本的SurgSTU数据集,实验证明其可显著提升VLMs在手术视频中的时空理解能力,尤其在上下文学习(in-context learning)和微调后表现最优。

链接: https://arxiv.org/abs/2604.00784
作者: Lennart Maack,Alexander Schlaefer
机构: Hamburg University of Technology, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset’s efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

[CV-53] Using predefined vector systems to speed up neural network multimillion class classification

【速读】:该论文旨在解决神经网络(Neural Networks, NNs)中标签预测(label prediction)的计算复杂度问题,传统方法在分类任务中通常具有O(n)的时间复杂度,其中n为类别数量,这在大规模分类场景下成为性能瓶颈。解决方案的关键在于利用神经网络隐空间(Latent Space, LS)的几何特性,将标签预测转化为一个常数时间复杂度O(1)的最近聚类中心搜索问题,具体通过构建一个用于隐空间配置(Latent Space Configuration, LSC)的目标向量系统来实现。该方法仅需对嵌入向量进行少量最大值和最小值索引查找,从而显著提升推理效率,并且保持原始模型训练精度不变。实验表明,该方法可在多个数据集上实现最高达11.6倍的整体加速,同时具备识别新类别存在的独特能力。

链接: https://arxiv.org/abs/2604.00779
作者: Nikita Gabdullin,Ilya Androsov
机构: Joint Stock "Research and production company “Kryptonite”
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, 3 tables, 2 algorithms, 1 theorem, 1 lemma

点击查看摘要

Abstract:Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can be significantly reduced. This is achieved by associating label prediction with the O(1) complexity closest cluster center search in a vector system used as target for latent space configuration (LSC). The proposed method only requires finding indexes of several largest and lowest values in the embedding vector making it extremely computationally efficient. We show that the proposed method does not change NN training accuracy computational results. We also measure the time required by different computational stages of NN inference and label prediction on multiple datasets. The experiments show that the proposed method allows to achieve up to 11.6 times overall acceleration over conventional methods. Furthermore, the proposed method has unique properties which allow to predict the existence of new classes.

[CV-54] PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition

【速读】:该论文旨在解决隐私保护型人体活动识别(Human Activity Recognition, HAR)研究中缺乏标准化评估框架的问题,现有方法多采用“清晰视频 vs. 单一隐私转换”的二元对比模式,限制了不同方法间的可比性,并模糊了隐私强度与识别效用之间的细粒度关系。解决方案的关键在于提出PrivHAR-Bench——一个多层次基准数据集,通过一系列渐进式视觉隐私变换(从轻量级空间模糊到加密块置换),对1,932段原始视频进行分层处理,每段视频覆盖9个隐私强度递增的层级,并提供背景移除变体以分离人体运动特征与场景上下文偏倚。该设计使得在标准条件下量化隐私-效用权衡成为可能,实验证明基于R3D-18模型的识别准确率随隐私强度提升呈现可测量且可解释的下降趋势,从而为隐私保护HAR方法提供了可控、可比较的基准平台。

链接: https://arxiv.org/abs/2604.00761
作者: Samar Ansari
机构: University of Chester (切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Existing research on privacy-preserving Human Activity Recognition (HAR) typically evaluates methods against a binary paradigm: clear video versus a single privacy transformation. This limits cross-method comparability and obscures the nuanced relationship between privacy strength and recognition utility. We introduce \textitPrivHAR-Bench, a multi-tier benchmark dataset designed to standardize the evaluation of the \textitPrivacy-Utility Trade-off in video-based action recognition. PrivHAR-Bench applies a graduated spectrum of visual privacy transformations: from lightweight spatial obfuscation to cryptographic block permutation, to a curated subset of 15 activity classes selected for human articulation diversity. Each of the 1,932 source videos is distributed across 9 parallel tiers of increasing privacy strength, with additional background-removed variants to isolate the contribution of human motion features from contextual scene bias. We provide lossless frame sequences, per-frame bounding boxes, estimated pose keypoints with joint-level confidence scores, standardized group-based train/test splits, and an evaluation toolkit computing recognition accuracy and privacy metrics. Empirical validation using R3D-18 demonstrates a measurable and interpretable degradation curve across tiers, with within-tier accuracy declining from 88.8% (clear) to 53.5% (encrypted, background-removed) and cross-domain accuracy collapsing to 4.8%, establishing PrivHAR-Bench as a controlled benchmark for comparing privacy-preserving HAR methods under standardized conditions. The dataset, generation pipeline, and evaluation code are publicly available.

[CV-55] IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

【速读】:该论文旨在解决大视觉语言模型(Large Vision Language Models, LVLMs)在图像和视频理解任务中因视觉标记(visual tokens)数量增加而导致计算成本急剧上升的问题。现有标记剪枝方法多依赖经验性策略,忽视了注意力机制的内在机理。解决方案的关键在于从注意力的对偶形式(dual form)视角出发,将注意力重新建模为一个隐式线性层,其权重矩阵由每个标记的键值对(key-value pair)生成的秩1外积之和构成;由此,标记剪枝转化为选择最优的秩1更新子集以逼近原始对偶权重矩阵。进一步地,作者基于此视角推导出一种新指标,可同时量化单个标记的信息量大小与信息冗余程度,并提出渐进式分块最大相关性(Progressive Chunked Maximal Marginal Relevance)算法高效选取该子集,在保持性能的同时显著提升效率。

链接: https://arxiv.org/abs/2604.00757
作者: Dong-Jae Lee,Sunghyun Baek,Junmo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token’s key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token’s information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

[CV-56] A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR

【速读】:该论文旨在解决历史报纸文本端到端光学字符识别(OCR)中的关键挑战,包括长文本序列建模、退化印刷质量以及复杂版式处理等问题。传统基于Transformer的识别模型虽性能优异,但其二次时间复杂度限制了段落级转录效率与大规模部署能力。论文提出一种基于状态空间模型(State-Space Models, SSMs)的新架构,首次将线性时间复杂度的Mamba模型引入OCR任务,结合CNN视觉编码器与双向及自回归Mamba序列建模机制,实现了高效且高精度的识别。实验表明,该方案在保持与先进神经网络相当准确率(如6.07%词错误率CER)的同时,推理速度提升2.05倍,内存增长仅为Transformer的1.26倍(对比2.30倍),显著优于现有主流方法(如DAN、Tesseract OCR等)。

链接: https://arxiv.org/abs/2604.00725
作者: Merveilles Agbeti-messan,Thierry Paquet,Clément Chatelain,Pierrick Tranouez,Stéphane Nicolas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released 99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.00725 [cs.CV] (or arXiv:2604.00725v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.00725 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Merveilles Agbeti-Messan [view email] [v1] Wed, 1 Apr 2026 10:33:33 UTC (470 KB)

[CV-57] A-Vid: Generalized Test-Time Adaptation for Video Reasoning

【速读】:该论文旨在解决当前视频推理模型依赖大规模监督数据和多阶段训练流程所带来的高成本与领域迁移困难问题。其核心解决方案是提出一种测试时强化学习框架(Test-Time Reinforcement Learning, TTA-Vid),通过在测试阶段对视频-语言数据进行自适应优化,无需标注数据或专门的训练集即可实现模型性能提升。关键创新在于:(1)利用跨不同帧子集的批量感知频率奖励作为伪标签,指导模型在推理过程中逐步调整策略;(2)引入多臂老虎机策略动态选择信息量高的帧,进一步增强适应能力。实验表明,该方法仅需单个样本甚至单批次数据即可实现跨数据集泛化,显著优于依赖大规模训练的现有最优方法。

链接: https://arxiv.org/abs/2604.00696
作者: Soumya Shamarao Jahagirdar,Edson Araujo,Anna Kukleva,M. Jehanzeb Mirza,Saurabhchand Bhati,Samuel Thomas,Brian Kingsbury,Rogerio Feris,James R. Glass,Hilde Kuehne
机构: Google(谷歌); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

[CV-58] P-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation

【速读】:该论文旨在解决多模态、多任务医学病灶分割中因共享编码器导致的特征纠缠、梯度干扰及病灶区分能力不足的问题。其解决方案的关键在于提出TP-Seg框架,通过两个核心机制实现:一是任务条件适配器(task-conditioned adapter),采用双路径专家结构平衡共享与任务特定表示,实现跨不同成像模态和病灶类型的自适应特征提取;二是原型引导的任务解码器(prototype-guided task decoder),引入可学习的任务原型作为语义锚点,并利用交叉注意力机制精细建模任务特定的前景与背景语义,从而提升模型在多样化医疗分割任务中的泛化性、可扩展性和临床适用性。

链接: https://arxiv.org/abs/2604.00684
作者: Jiawei Xu,Qiangqiang Zhou,Dandan Zhu,Yong Chen,Yugen Yi,Xiaoqi Zhao
机构: Jiangxi Normal University (江西师范大学); East China Normal University (华东师范大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building a unified model with a single set of parameters to efficiently handle diverse types of medical lesion segmentation has become a crucial objective for AI-assisted diagnosis. Existing unified segmentation approaches typically rely on shared encoders across heterogeneous tasks and modalities, which often leads to feature entanglement, gradient interference, and suboptimal lesion discrimination. In this work, we propose TP-Seg, a task-prototype framework for unified medical lesion segmentation. On one hand, the task-conditioned adapter effectively balances shared and task-specific representations through a dual-path expert structure, enabling adaptive feature extraction across diverse medical imaging modalities and lesion types. On the other hand, the prototype-guided task decoder introduces learnable task prototypes as semantic anchors and employs a cross-attention mechanism to achieve fine-grained modeling of task-specific foreground and background semantics. Without bells and whistles, TP-Seg consistently outperforms specialized, general and unified segmentation methods across 8 different medical lesion segmentation tasks covering multiple imaging modalities, demonstrating strong generalization, scalability and clinical applicability.

[CV-59] MoonAnything: A Vision Benchmark with Large-Scale Lunar Supervised Data ACM-MM

【速读】:该论文旨在解决当前月球表面感知任务中缺乏同时具备几何与光度监督的高质量数据集问题,这限制了基于学习的感知系统在复杂光照和低纹理环境下的鲁棒性提升。解决方案的关键在于构建MoonAnything这一统一基准,其核心创新是基于真实月面地形并采用物理渲染(physically-based rendering)技术,首次实现了大规模、多光照条件下兼具密集深度图(几何监督)和逼真图像(光度监督)的数据供给;具体包含两个互补子集:LunarGeo提供用于3D重建与位姿估计的立体图像与稠密深度图,LunarPhoto则通过空间变化的双向反射分布函数(Bidirectional Reflectance Distribution Function, BRDF)模型生成高保真图像及多种太阳光照配置下的多光照渲染结果,从而支持反射率估计与光照鲁棒感知。该方案为月球及其他无大气天体的视觉感知算法提供了前所未有的测试平台与训练资源。

链接: https://arxiv.org/abs/2604.00682
作者: Clémentine Grethen,Yuang Shi,Simone Gasparini,Géraldine Morin
机构: IRIT - Université de Toulouse France (图卢兹大学IRIT研究所); National University of Singapore Singapore (新加坡国立大学); IPAL, IRL2955 Singapore (IPAL, 新加坡IRL2955)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MMSys 2026

点击查看摘要

Abstract:Accurate perception of lunar surfaces is critical for modern lunar exploration missions. However, developing robust learning-based perception systems is hindered by the lack of datasets that provide both geometric and photometric supervision. Existing lunar datasets typically lack either geometric ground truth, photometric realism, illumination diversity, or large-scale coverage. In this paper, we introduce MoonAnything, a unified benchmark built on real lunar topography with physically-based rendering, providing the first comprehensive geometric and photometric supervision under diverse illumination with large scale. The benchmark comprises two complementary sub-datasets : i) LunarGeo provides stereo images with corresponding dense depth maps and camera calibration enabling 3D reconstruction and pose estimation; ii) LunarPhoto provides photorealistic images using a spatially-varying BRDF model, along with multi-illumination renderings under real solar configurations, enabling reflectance estimation and illumination-robust perception. Together, these datasets offer over 130K samples with comprehensive supervision. Beyond lunar applications, MoonAnything offers a unique setting and challenging testbed for algorithms under low-textured, high-contrast conditions and applies to other airless celestial bodies and could generalize beyond. We establish baselines using state-of-the-art methods and release the complete dataset along with generation tools to support community extension: this https URL.

[CV-60] CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

【速读】:该论文旨在解决现有基准测试在评估视频大语言模型(Video-LLMs)持续学习能力时存在的局限性,包括对未经过大规模预训练模型的依赖以及因单一数据集划分导致的任务冗余和遗忘效应微弱的问题。其解决方案的关键在于提出CL-VISTA基准,该基准通过整合8个涵盖感知、理解与推理的多样化任务,引入显著的数据分布偏移以有效暴露灾难性遗忘现象;同时构建了一个包含性能、计算效率和内存占用三个维度的系统性评估框架,用于全面衡量持续学习方法的实际效果,从而揭示当前主流方法在缓解遗忘与保持泛化能力之间的根本权衡关系。

链接: https://arxiv.org/abs/2604.00677
作者: Haiyang Guo,Yichen Shi,Fei Zhu,Wenzhuo Liu,Hongbo Zhao,Fanhu Zeng,Shijie Ma,Da-Han Wang,Xu-Yao Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Hong Kong Institute of Science Innovation, CAS (香港科学创新研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); School of Computer and Information Engineering, Xiamen University of Technology (厦门理工学院计算机与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

[CV-61] When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

【速读】:该论文旨在解决当前皮肤镜图像(dermoscopic images)中人工智能(AI)诊断系统与人类专家之间性能差异的根源问题,尤其是识别出AI模型频繁误判的图像是否源于算法偏差或图像本身的视觉模糊性。其解决方案的关键在于通过构建一个对照实验组,由独立的皮肤科专家对AI系统反复误判的图像和正常图像进行评估,发现这些“困难样本”不仅导致AI性能下降,也显著削弱了人类专家的一致性和准确性——具体表现为专家对困难图像的诊断一致性(Cohen’s kappa)从0.61骤降至0.08,且专家间共识(Fleiss kappa)从0.456降至0.275。这一结果表明,AI的失败并非单纯由算法偏见引起,而是由于图像本身存在固有的视觉复杂性或质量缺陷,从而揭示了图像质量是影响人机诊断一致性的核心因素。

链接: https://arxiv.org/abs/2604.00651
作者: Loris Cino,Pier Luigi Mazzeo,Alessandro Martella,Giulia Radi,Renato Rossi,Cosimo Distante
机构: Consiglio Nazionale delle Ricerche (意大利国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen’s kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

[CV-62] DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization CVPR2026

【速读】:该论文旨在解决基于鱼眼相机(fisheye camera)输入的3D高斯泼溅(3D Gaussian Splatting, 3DGS)重建中因图像去畸变预处理导致的信息损失与细节稀释问题,以及由此引发的边缘浮点伪影(floater artifacts)。传统方法在训练前对鱼眼图像进行去畸变处理,虽能适配原生3DGS框架,但会引入黑边边界导致信息丢失,并因重采样操作降低像素密度,使模型在低频区域过拟合,从而产生模糊和漂浮现象。论文的关键解决方案是将鱼眼相机模型直接嵌入原始3DGS框架,实现无需去畸变预处理的原生鱼眼图像输入;进一步提出一种基于特征重叠驱动的跨视角联合优化策略(feature-overlap-driven cross-view joint optimization),通过建立多视角间一致的几何与光度约束,有效缓解因单次随机选图优化所造成的高斯分布极端形变问题,显著提升重建质量。

链接: https://arxiv.org/abs/2604.00648
作者: Zhengxian Yang,Fei Xie,Xutao Xue,Rui Zhang,Taicheng Huang,Yang Liu,Mengqi Ji,Tao Yu
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学); JD.com, Beijing, China (京东); Shanghai AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch-and-interpolate resampling spreads each pixel’s value over a larger area, diluting detail density – causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS’s original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.

[CV-63] LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics ICIP2026

【速读】:该论文旨在解决当前高性能全景分割(panoptic segmentation)模型计算复杂度高、难以部署于资源受限平台(如移动机器人)的问题。其解决方案的关键在于提出一种轻量化架构LiPS,该架构在保留基于查询(query-based)解码机制的基础上,设计了简化的特征提取与融合路径,从而在不显著牺牲分割精度的前提下,大幅降低计算开销——实验表明,LiPS在标准基准上性能接近更复杂的基线模型,同时帧率提升最高达4.5倍,计算量减少约6.8倍,显著提升了模型的实时性与实用性。

链接: https://arxiv.org/abs/2604.00634
作者: Calvin Galagain,Martyna Poreba,François Goulette,Cyrill Stachniss
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE ICIP 2026. Under review

点击查看摘要

Abstract:Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

[CV-64] ALENT: Target-aware Efficient Tuning for Referring Image Segmentation CVPR26

【速读】:该论文针对参考图像分割(Referring Image Segmentation, RIS)中基于参数高效微调(Parameter-Efficient Tuning, PET)方法存在的“非目标激活”(Non-Target Activation, NTA)问题展开研究,即视觉特征在文本引导下无法准确聚焦于目标实例,反而激活了同类别但无关的物体。解决方案的关键在于提出名为 TALENT 的新框架,其核心创新为两个模块:一是修正代价聚合器(Rectified Cost Aggregator, RCA),用于高效聚合与文本相关的视觉特征;二是目标感知学习机制(Target-aware Learning Mechanism, TLM),包含上下文成对一致性学习和以目标为中心的对比学习,前者通过句子级文本特征构建语义关联图优化视觉特征的语义一致性,后者增强目标定位能力并抑制与其他无关对象的关联,二者协同有效缓解NTA问题。

链接: https://arxiv.org/abs/2604.00609
作者: Shuo Jin,Siyue Yu,Bingfeng Zhang,Chao Yao,Meiqin Liu,Jimin Xiao
机构: XJTLU(西交利物浦大学); University of Liverpool(利物浦大学); China University of Petroleum (East China)(中国石油大学(华东)); University of Science and Technology Beijing(北京科技大学); Beijing Jiaotong University(北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26 Findings

点击查看摘要

Abstract:Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can’t emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, terming it the non-target activation' (NTA) issue. To address this, we propose a novel framework, TALENT, which utilizes target-aware efficient tuning for PET-based RIS. Specifically, we first propose a Rectified Cost Aggregator (RCA) to efficiently aggregate text-referred features. Then, to calibrate NTA’ into accurate target activation, we adopt a Target-aware Learning Mechanism (TLM), including contextual pairwise consistency learning and target-centric contrastive learning. The former uses the sentence-level text feature to achieve a holistic understanding of the referent and constructs a text-referred affinity map to optimize the semantic association of visual features. The latter further enhances target localization to discover the distinct instance while suppressing associations with other unrelated ones. The two objectives work in concert and address `NTA’ effectively. Extensive evaluations show that TALENT outperforms existing methods across various metrics (e.g., 2.5% mIoU gains on G-Ref val set). Our codes will be released at: this https URL.

[CV-65] Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

【速读】:该论文旨在解决当前对抗攻击防御机制中一个隐含但未被验证的核心假设问题:即对象检测器在遭受对抗攻击时,检测数量(detection count)与精度(如mAP)通常呈同步下降趋势。然而,作者通过实证发现了一种新的失效模式——“质量腐蚀”(Quality Corruption, QC),即检测数量保持稳定而精度显著崩溃的现象,这在EMS-YOLO这一脉冲神经网络(Spiking Neural Network, SNN)模型中被观察到(保留超过70%的检测数,mAP从0.528降至0.042)。解决方案的关键在于揭示并命名这一非耦合失效模式,从而挑战现有防御体系依赖单一模型基底(substrate)所形成的共性假设,并指出当前标准防御组件对QC无效,表明对抗鲁棒性评估需考虑模型架构的敏感性差异。

链接: https://arxiv.org/abs/2604.00605
作者: Daye Kang,Hyeongboo Baek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

[CV-66] KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

【速读】:该论文旨在解决医学视觉问答(Med-VQA)任务中两个核心问题:一是现有方法未能充分融合领域特定的医学知识,导致难以准确关联医学图像中的病灶特征与关键诊断标准;二是基于分类的方法依赖预定义的答案集,无法有效适应自由形式答案的多样性,易忽略答案中的细粒度语义信息。解决方案的关键在于提出一种知识图谱增强的跨模态交互框架(KG-CMI),其核心创新包括:通过知识图谱嵌入(KGE)模块将专业医学知识结构化并融入模型,实现病灶特征与疾病知识之间的显式关联;以及设计自由形式答案增强的多任务学习(FAMT)模块,利用开放式问题中的辅助信息提升模型对自由答案的理解能力,从而显著提升模型在多数据集上的性能表现。

链接: https://arxiv.org/abs/2604.00601
作者: Xianyao Zheng,Hong Yu,Hui Cui,Changming Sun,Xiangyu Li,Ran Su,Leyi Wei,Jia Zhou,Junbo Wang,Qiangguo Jin
机构: Northwestern Polytechnical University (西北工业大学); Tianjin Central Hospital of Gynecology Obstetrics (天津妇产医院); La Trobe University (拉特罗布大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61); Harbin Institute of Technology (哈尔滨工业大学); Tianjin University (天津大学); Macao Polytechnic University (澳门理工学院); Tianjin Chest Hospital (天津胸科医院); Yangtze River Delta Research Institute of Northwestern Polytechnical University (西北工业大学长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model’s capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework’s effectiveness.

[CV-67] owards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors CVPR

【速读】:该论文旨在解决自动驾驶系统中因相机视角变化导致轨迹规划鲁棒性下降的问题,尤其是在训练时未见的视角条件下性能显著退化。其解决方案的关键在于提出一种无需数据增强的方法,通过利用3D基础模型(3D foundation model)提供的几何先验信息,将基于深度估计获得的像素级3D位置作为位置嵌入注入模型,并采用交叉注意力机制融合中间几何特征,从而提升模型对相机视角变化的适应能力。

链接: https://arxiv.org/abs/2604.00597
作者: Hiroki Hashimoto,Hiromichi Goto,Hiroyuki Sugai,Hiroshi Kera,Kazuhiko Kawamoto
机构: Chiba University(千叶大学); SUZUCA.AI; National Institute of Informatics(信息科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Workshop on Simulation for Autonomous Driving 2026

点击查看摘要

Abstract:Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

[CV-68] FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning CVPR2026

【速读】:该论文旨在解决家禽疾病(尤其是高致病性禽流感)早期检测中因农场数据隐私顾虑和机构数据孤岛导致的计算机视觉模型规模化部署难题,同时应对现有开源农业数据集普遍存在的未标注数据污染问题。其关键解决方案是提出一个名为FecalFed的隐私保护联邦学习框架:首先构建并公开了一个经严格去重处理的粪便图像数据集poultry-fecal-fl(含8,770张唯一图像,覆盖4类疾病),揭示并消除了主流公共数据库中高达46.89%的数据重复率;其次,在高度非独立同分布(non-IID)条件下(Dirichlet α=0.5)验证了该框架的有效性——相比单农场训练仅达64.86%准确率,采用服务器端自适应优化(FedAdam)与Swin-Small架构的联邦方法实现90.31%准确率,接近集中式训练上限(95.10%),且边缘优化的Swin-Tiny模型仍保持89.74%竞争力,从而为农场级禽类疾病监测提供了高效、隐私优先的技术范式。

链接: https://arxiv.org/abs/2604.00559
作者: Tien-Yu Chi
机构: National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2026 Workshop on Vision for Agriculture

点击查看摘要

Abstract:Early detection of highly pathogenic avian influenza (HPAI) and endemic poultry diseases is critical for global food security. While computer vision models excel at classifying diseases from fecal imaging, deploying these systems at scale is bottlenecked by farm data privacy concerns and institutional data silos. Furthermore, existing open-source agricultural datasets frequently suffer from severe, undocumented data contamination. In this paper, we introduce \textbfFecalFed , a privacy-preserving federated learning framework for poultry disease classification. We first curate and release \textttpoultry-fecal-fl , a rigorously deduplicated dataset of 8,770 unique images across four disease classes, revealing and eliminating a 46.89 % duplication rate in popular public repositories. To simulate realistic agricultural environments, we evaluate FecalFed under highly heterogeneous, non-IID conditions (Dirichlet \alpha=0.5 ). While isolated single-farm training collapses under this data heterogeneity, yielding only 64.86 % accuracy, our federated approach recovers performance without centralizing sensitive data. Specifically, utilizing server-side adaptive optimization (FedAdam) with a Swin-Small architecture achieves 90.31 % accuracy, closely approaching the centralized upper bound of 95.10%. Furthermore, we demonstrate that an edge-optimized Swin-Tiny model maintains highly competitive performance at 89.74 % , establishing a highly efficient, privacy-first blueprint for on-farm avian disease monitoring.

[CV-69] STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO ICME2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂拓扑结构下的空间推理任务中,现有方法如可视化思维(Visualization-of-Thought, VoT)易产生级联错误的问题。其解决方案的关键在于提出一种基于拓扑锚点的两阶段框架STAR:第一阶段通过监督微调(Supervised Fine-Tuning)使模型内化空间语义并修剪冗余路径;第二阶段采用空间感知的分段直接偏好优化(Spatial-aware Segment-level Direct Preference Optimization, SDPO),以提升长程导航中的自我修正能力。该方法显著提升了模型在RedMaze-23K数据集上的表现,其32B版本在开源模型中达到最优性能,接近GPT-4的82.4%。

链接: https://arxiv.org/abs/2604.00558
作者: Pukun Zhao,Longxiang Wang,Chen Chen,Peicheng Wang,Fanqing Zhou,Runze Li,Haojian Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 4 tables, Accepted by ICME 2026

点击查看摘要

Abstract:Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4’s performance.

[CV-70] Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

【速读】:该论文旨在解决模仿学习(Imitation Learning)在机器人操作任务中泛化能力受限的问题,其根本原因在于专家示范(expert demonstrations)的多样性不足,而跨环境收集示范又成本高昂且难以实现。解决方案的关键在于通过扩展摄像机视角(scaling camera views)来利用场景内在的多样性,无需额外人工干预:在示范采集阶段,使用多个同步摄像机视角从每条专家轨迹生成伪示范(pseudo-demonstrations),从而丰富训练分布并提升视觉表示对视角变化的不变性;同时引入多视角动作聚合方法,使单视角策略在部署时也能受益于多视角信息。该方法显著提升了数据效率和泛化性能,且仅需少量硬件扩展,可无缝集成至现有模仿学习算法中。

链接: https://arxiv.org/abs/2604.00557
作者: Yichen Xie,Yixiao Wang,Shuqi Zhao,Cheng-En Wu,Masayoshi Tomizuka,Jianwen Xie,Hao-Shu Fang
机构: UC Berkeley (加州大学伯克利分校); Lambda, Inc (Lambda公司); MIT (麻省理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is this https URL.

[CV-71] F-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection CVPR26

【速读】:该论文旨在解决共显著目标检测(Co-salient Object Detection, CoSOD)中现有训练依赖方法受限于封闭数据集、泛化能力不足的问题。其解决方案的关键在于提出一种无需训练的新型方法TF-SSD,该方法通过结合Segment Anything Model (SAM)与DINO视觉基础模型(Vision Foundation Models, VFMs)的优势:首先利用SAM生成全面的原始候选掩码池,再通过一个基于SAM的高质量掩码生成器过滤冗余掩码;进一步引入基于DINO注意力图的图像内显著性滤波机制以增强语义感知能力,并设计跨图像原型选择器,利用跨图像原型间的相似性评分筛选出具有高一致性的显著掩码作为最终预测结果。此协同机制有效提升了模型在未见场景下的泛化性能与显著性理解能力。

链接: https://arxiv.org/abs/2604.00549
作者: Zhijin He,Shuo Jin,Siyue Yu,Shuwei Wu,Bingfeng Zhang,Li Yu,Jimin Xiao
机构: XJTLU; University of Liverpool; China University of Petroleum (East China); Nanjing University of Information Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26

点击查看摘要

Abstract:Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO’s attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes are available at this https URL.

[CV-72] Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations CVPR2026

【速读】:该论文旨在解决当前前馈重建模型(Feed-forward Reconstruction Models, FFRMs)在训练过程中对多视角几何标注(如3D点云和相机位姿)的过度依赖问题,这一依赖限制了其在大规模场景下的可扩展性。解决方案的关键在于提出一种弱监督范式Reliev3R,该方法无需昂贵的多视角几何数据和计算密集的结构光恢复(Structure-from-Motion)预处理,而是直接利用预训练模型零样本预测得到的单目相对深度和图像稀疏对应关系来提供监督信号。Reliev3R的核心创新在于设计了一种歧义感知的相对深度损失和基于三角几何的重投影损失,从而有效促进多视角几何一致性建模,使模型在仅用较少标注数据的情况下即可从头训练,并达到与全监督模型相当的重建性能。

链接: https://arxiv.org/abs/2604.00548
作者: Youyu Chen,Junjun Jiang,Yueru Luo,Kui Jiang,Xianming Liu,Xu Yan,Dave Zhenyu Chen
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei (华为); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

[CV-73] Neuropsychiatric Deviations From Normative Profiles: An MRI-Derived Marker for Early Alzheimers Disease Detection

【速读】:该论文旨在解决当前神经精神症状(Neuropsychiatric Symptoms, NPS)评估工具无法区分其是正常老化表现还是阿尔茨海默病(Alzheimer’s Disease, AD)早期标志的问题,从而限制了NPS作为AD早期检测指标的临床应用价值。解决方案的关键在于提出了一种基于深度学习的规范性建模框架,利用结构磁共振成像(structural MRI)数据训练一个三维卷积神经网络(3D Convolutional Neural Network),学习大脑解剖结构与神经精神问卷量表(Neuropsychiatric Inventory Questionnaire, NPIQ)评分之间的映射关系;通过计算预测得分与实际观察得分之间的差异(即DNPI,Divergence from NPIQ scores),量化个体NPS负担的异常程度,该指标显著关联未来AD转化风险,并具备与脑脊液Aβ42相当的预测效能(AUC=0.74 vs 0.75)。

链接: https://arxiv.org/abs/2604.00545
作者: Synne Hjertager Osenbroch,Lisa Ramona Rosvold,Yao Lu,Alvaro Fernandez-Quilez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and to be presented (ORAL) in ISBI 2026

点击查看摘要

Abstract:Neuropsychiatric symptoms (NPS) such as depression and apathy are common in Alzheimer’s disease (AD) and often precede cognitive decline. NPS assessments hold promise as early detection markers due to their correlation with disease progression and their non-invasive nature. Yet current tools cannot distinguish whether NPS are part of aging or early signs of AD, limiting their utility. We present a deep learning-based normative modelling framework to identify atypical NPS burden from structural MRI. A 3D convolutional neural network was trained on cognitively stable participants from the Alzheimer’s Disease Neuroimaging Initiative, learning the mapping between brain anatomy and Neuropsychiatric Inventory Questionnaire (NPIQ) scores. Deviations between predicted and observed scores defined the Divergence from NPIQ scores (DNPI). Higher DNPI was associated with future AD conversion (adjusted OR=2.5; p 0.01) and achieved predictive accuracy comparable to cerebrospinal fluid AB42 (AUC=0.74 vs 0.75). Our approach supports scalable, non-invasive strategies for early AD detection.

[CV-74] RiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting WWW

【速读】:该论文旨在解决现有4D Gaussian Splatting (4DGS) 方法在动态场景重建中因采用分段线性速度近似和短时窗建模所导致的严重时间碎片化问题,进而引发对象长期时间身份丢失及高斯点数量无限制增长,阻碍其扩展至长视频序列。解决方案的关键在于提出TRiGS,一种基于统一连续几何变换的新颖4D表示方法,通过整合SE(3)变换、分层贝塞尔残差(hierarchical Bezier residuals)与可学习局部锚点(learnable local anchors),实现对单个基元(primitive)刚体运动的几何一致建模,从而保持时间连续性和对象身份,并有效抑制内存无界增长。

链接: https://arxiv.org/abs/2604.00538
作者: Suwoong Yeom,Joonsik Nam,Seunggyu Choi,Lucas Yunkyu Lee,Sangmin Kim,Jaesik Park,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong,Sukju Kang
机构: Samsung Research(三星研究院); Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent 4D Gaussian Splatting (4DGS) methods achieve impressive dynamic scene reconstruction but often rely on piecewise linear velocity approximations and short temporal windows. This disjointed modeling leads to severe temporal fragmentation, forcing primitives to be repeatedly eliminated and regenerated to track complex nonlinear dynamics. This makeshift approximation eliminates the long-term temporal identity of objects and causes an inevitable proliferation of Gaussians, hindering scalability to extended video sequences. To address this, we propose TRiGS, a novel 4D representation that utilizes unified, continuous geometric transformations. By integrating SE(3) transformations, hierarchical Bezier residuals, and learnable local anchors, TRiGS models geometrically consistent rigid motions for individual primitives. This continuous formulation preserves temporal identity and effectively mitigates unbounded memory growth. Extensive experiments demonstrate that TRiGS achieves high fidelity rendering on standard benchmarks while uniquely scaling to extended video sequences (e.g., 600 to 1200 frames) without severe memory bottlenecks, significantly outperforming prior works in temporal stability.

[CV-75] MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy

【速读】:该论文旨在解决口腔全景X光片(Orthopantomogram, OPG)中多任务协同诊断问题,包括牙齿检测、龋齿分割(Caries Segmentation, CarSeg)、异常检测(Anomaly Detection, AD)和牙发育阶段分类(Dental Developmental Staging, DDS)。其核心挑战在于如何统一建模不同任务间的复杂空间关系与共享特征表示。解决方案的关键是提出MATHENA框架,该框架基于Mamba的线性复杂度状态空间模型(State Space Model, SSM),构建了两级结构:第一级MATHE模块采用四方向视觉状态空间(Vision State Space, VSS)块实现O(N)全局上下文建模,精准定位每颗牙齿并生成裁剪图像;第二级HENA模块为轻量级Mamba-UNet,具有三头架构,其中龋齿分割作为上游任务先训练并冻结共享特征,再用于下游异常检测和牙发育阶段分类的线性探针微调,从而实现稳定高效的多任务学习。

链接: https://arxiv.org/abs/2604.00537
作者: Kyeonghun Kim,Jaehyung Park,Youngung Han,Anna Jung,Seongbin Park,Sumin Lee,Jiwon Yang,Jiyoon Han,Subeen Lee,Junsu Lim,Hyunsu Go,Eunseob Choi,Hyeonseok Jung,Soo Yong Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Hyuk-Jae Lee,Ken Ying-Kai Liao,Nam-Joon Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Dental diagnosis from Orthopantomograms (OPGs) requires coordination of tooth detection, caries segmentation (CarSeg), anomaly detection (AD), and dental developmental staging (DDS). We propose Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy (MATHENA), a unified framework leveraging Mamba’s linear-complexity State Space Models (SSM) to address all four tasks. MATHENA integrates MATHE, a multi-resolution SSM-driven detector with four-directional Vision State Space (VSS) blocks for O(N) global context modeling, generating per-tooth crops. These crops are processed by HENA, a lightweight Mamba-UNet with a triple-head architecture and Global Context State Token (GCST). In the triple-head architecture, CarSeg is first trained as an upstream task to establish shared representations, which are then frozen and reused for downstream AD fine-tuning and DDS classification via linear probing, enabling stable, efficient learning. We also curate PARTHENON, a benchmark comprising 15,062 annotated instances from ten datasets. MATHENA achieves 93.78% mAP@50 in tooth detection, 90.11% Dice for CarSeg, 88.35% for AD, and 72.40% ACC for DDS.

[CV-76] FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在实际应用中因运动伪影和光照波动导致生理信号易被噪声淹没的问题。现有方法多依赖时域建模,难以有效分离微弱的生理特征。其解决方案的关键在于提出一种频域引导的rPPG框架FreqPhys,通过引入生理频率先验信息提升信号恢复的鲁棒性:首先使用生理带通滤波模块抑制带外干扰,再结合生理频谱调制与自适应频谱选择增强脉搏相关频段成分并抑制带内残留噪声;进一步利用跨域表示学习模块融合频域先验与深层时域特征以捕捉时空依赖关系,并最终采用频域感知的条件扩散过程逐步重建高质量rPPG信号。

链接: https://arxiv.org/abs/2604.00534
作者: Wei Qian,Dan Guo,Jinxing Zhou,Bochao Zou,Zitong Yu,Meng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables contactless physiological monitoring by capturing subtle skin-color variations from facial videos. However, most existing methods predominantly rely on time-domain modeling, making them vulnerable to motion artifacts and illumination fluctuations, where weak physiological clues are easily overwhelmed by noise. To address these challenges, we propose FreqPhys, a frequency-guided rPPG framework that explicitly leverages physiological frequency priors for robust signal recovery. Specifically, FreqPhys first applies a Physiological Bandpass Filtering module to suppress out-of-band interference, and then performs Physiological Spectrum Modulation together with adaptive spectral selection to emphasize pulse-related frequency components while suppress residual in-band noise. A Cross-domain Representation Learning module further fuses these spectral priors with deep time-domain features to capture informative spatial–temporal dependencies. Finally, a frequency-aware conditional diffusion process progressively reconstructs high-fidelity rPPG signals. Extensive experiments on six benchmarks demonstrate that FreqPhys yields significant improvements over state-of-the-art approaches, particularly under challenging motion conditions. It highlights the importance of explicitly modeling physiological frequency priors. The source code will be released.

[CV-77] AceTone: Bridging Words and Colors for Conditional Image Grading CVPR2026

【速读】:该论文旨在解决传统色彩分级(color grading)方法在应对多样化创作意图和人类审美偏好时的泛化能力不足问题,现有方法通常依赖局部区域重着色或固定滤波器组,难以实现语义可控且风格一致的色彩调整。其解决方案的关键在于提出AceTone,一个统一框架下的多模态条件色彩分级方法,将色彩调整建模为生成式颜色变换任务,通过文本提示或参考图像条件生成3D-LUT(Lookup Table),并引入基于VQ-VAE的离散令牌化器压缩LUT向量以提升效率与保真度(ΔE2),同时构建大规模数据集AceTone-800K并结合视觉-语言模型与强化学习优化感知保真度与美学一致性,从而显著优于现有方法,在文本引导和参考图像引导任务中均实现LPIPS指标提升最高达50%,并通过人评验证其结果在视觉美感与风格一致性上的优越性。

链接: https://arxiv.org/abs/2604.00530
作者: Tianren Ma,Mingxiang Liao,Xijin Zhang,Qixiang Ye
机构: University of Chinese Academy of Sciences (中国科学院大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project Page: this http URL

点击查看摘要

Abstract:Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a 3\times32^3 LUT vector to 64 discrete tokens with \Delta E2 fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone’s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

[CV-78] hink Act Build: An Agent ic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

【速读】:该论文旨在解决3D视觉定位(3D Visual Grounding, 3D-VG)任务中现有方法依赖预处理点云导致的静态流程问题,即传统方法将定位简化为提案匹配,难以有效处理复杂空间语义。其解决方案的关键在于提出一个动态代理框架“Think, Act, Build (TAB)”,该框架通过解耦任务:利用2D视觉语言模型(VLM)解析复杂空间语义,同时借助确定性多视角几何重建3D结构;特别地,引入语义锚定几何扩展机制(Semantic-Anchored Geometric Expansion),在参考视频片段中锚定目标后,结合相机参数将2D视觉线索映射至3D坐标,从而实现从原始RGB-D流直接生成目标的3D表示,克服了因严格VLM语义跟踪导致的多视角覆盖不足问题。

链接: https://arxiv.org/abs/2604.00528
作者: Haibo Wang,Zihao Lin,Zhiyang Xu,Lifu Huang
机构: University of California, Davis (加州大学戴维斯分校); Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose “Think, Act, Build (TAB)”, a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to “Build” the target’s 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.

[CV-79] Learnability-Guided Diffusion for Dataset Distillation CVPR2026

【速读】:该论文旨在解决现有数据蒸馏方法中存在冗余训练信号的问题,即生成的合成数据集中的样本往往包含重叠信息而非互补知识,导致训练效率低下。其解决方案的关键在于提出一种基于可学习性(learnability)驱动的数据蒸馏框架,通过分阶段增量构建合成数据集:从初始小样本集出发,利用当前模型的可学习性得分识别其能从中学习的新信息,并据此生成具有高训练效用且与参考模型一致性的样本,从而形成自适应课程学习路径。该方法显著降低冗余度39.1%,并提升各训练阶段的样本特异性,最终在ImageNet-1K、ImageNette和ImageWoof等基准上取得最优性能。

链接: https://arxiv.org/abs/2604.00519
作者: Jeffrey A. Chan-Santiago,Mubarak Shah
机构: University of Central Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page this https URL.

[CV-80] oward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

【速读】:该论文旨在解决现有可穿戴传感器辅助动物行为识别(Animal Activity Recognition, AAR)方法中,尽管整体分类性能较好,但特定行为类别的识别准确率仍不理想的问题。这一问题主要源于采样率设置不当和类别不平衡(class imbalance)导致的分类器偏向多数类。为实现农场动物各类行为的高精度识别,作者提出一种新型的个体行为感知网络(Individual-Behavior-Aware Network, IBA-Net),其核心创新在于两个模块:一是基于Mixture-of-Experts(MoE)的特征定制模块(MFC),通过自适应融合多采样率数据,提取针对不同行为特性的定制化特征;二是神经坍缩驱动的分类器校准模块(NC3),在分类阶段引入固定等角紧框架(ETF)作为分类器,最大化类别间分类向量夹角,从而缓解类别不平衡带来的负面影响,提升少数类行为的识别性能。

链接: https://arxiv.org/abs/2604.00517
作者: Axiu Mao,Meilu Zhu,Lei Shen,Xiaoshuai Wang,Tomas Norton,Kai Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.

[CV-81] MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

【速读】:该论文旨在解决三维(3D)医学影像(如CT扫描)深度学习模型训练中因标注数据稀缺而导致的性能瓶颈问题。现有方法常依赖在自然图像上预训练,但存在显著域偏移(domain shift),而当前主流自监督学习(SSL)框架又未能充分利用CT数据的3D结构特性,通常将3D扫描视为独立的2D切片处理,从而丢失了轴向一致性与3D空间上下文信息。为克服这一局限,作者提出MAESIL(autoencoder for enhanced self-supervised medical image learning),其核心创新在于引入“超patch”(superpatch)——一种基于3D块的输入单元,在保持3D结构信息的同时兼顾计算效率;并通过双掩码策略(dual-masking strategy)驱动3D掩码自编码器(3D masked autoencoder)学习更全面的空间表征,显著提升了重建质量(PSNR和SSIM指标)。

链接: https://arxiv.org/abs/2604.00514
作者: Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Junsu Lim,YeonJu Jean,Seongbin Park,Eunseob Choi,Hyunsu Go,SeoYoung Ju,Seohyoung Park,Gyeongmin Kim,MinJu Kwon,KyungSeok Yuh,Soo Yong Kim,Ken Ying-Kai Liao,Nam-Joon Kim,Hyuk-Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures. Accepted at ICEIC 2026

点击查看摘要

Abstract:Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the ‘superpatch’, a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.

[CV-82] RT-GS: Gaussian Splatting with Reflection and Transmittance Primitives

【速读】:该论文旨在解决高斯点绘(Gaussian Splatting, GS)在重建复杂材质时无法同时准确建模镜面反射(specular reflection)和半透明表面透射(transmittance)的问题,而这两种光学现象对实现真实感的新视角合成至关重要。现有方法未能从物理角度正确模拟这些交互过程,导致重建结果在包含镜面或透明物体的场景中表现不佳。解决方案的关键在于提出RT-GS框架,其核心创新是将微表面材质模型(microfacet material model)与可微分光线追踪(differentiable ray tracing)相结合,并通过分离的高斯原型分别表示反射和透射成分,从而实现镜面反射与透射效应的联合建模与渲染,显著提升了复杂环境中透明与反射结构的重建质量。

链接: https://arxiv.org/abs/2604.00509
作者: Kunnong Zeng,Chensheng Peng,Yichen Xie,Masayoshi Tomizuka,Cem Yuksel
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting is a powerful tool for reconstructing diffuse scenes, but it struggles to simultaneously model specular reflections and the appearance of objects behind semi-transparent surfaces. These specular reflections and transmittance are essential for realistic novel view synthesis, and existing methods do not properly incorporate the underlying physical processes to simulate them. To address this issue, we propose RT-GS, a unified framework that integrates a microfacet material model and ray tracing to jointly model specular reflection and transmittance in Gaussian Splatting. We accomplish this by using separate Gaussian primitives for reflections and transmittance, which allow modeling distant reflections and reconstructing objects behind transparent surfaces concurrently. We utilize a differentiable ray tracing framework to obtain the specular reflection and transmittance appearance. Our experiments demonstrate that our method successfully produces reflections and recovers objects behind transparent surfaces in complex environments, achieving significant qualitative improvements over prior methods where these specular light interactions are prominent.

[CV-83] RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection CVPR2026

【速读】:该论文旨在解决弱监督人类-物体交互(Human-Object Interaction, HOI)检测中存在的两个核心问题:一是由于缺乏定位信号,传统方法依赖外部物体检测器生成候选对并进行成对推理,导致计算开销大、难以扩展;二是非交互组合带来的误报问题,阻碍了实例级HOI推理的准确性。解决方案的关键在于提出Relational Grounding Transformer (RegFormer),该模块通过利用空间上接地的信号作为推理引导,促进局部感知的交互学习,从而在图像级监督下直接识别出人类、物体及其交互关系,实现无需额外训练即可从图像级推理高效迁移至精确的实例级推理。

链接: https://arxiv.org/abs/2604.00507
作者: Jihwan Park,Chanhyeong Yang,Jinyoung Park,Taehoon Song,Hyunwoo J. Kim
机构: KAIST(韩国科学技术院); LG Energy Solution( LG能源解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR2026

点击查看摘要

Abstract:Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at this https URL.

[CV-84] PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

【速读】:该论文旨在解决开放集目标检测(Open-Set Object Detection, OSOD)中因文本表示与复杂视觉概念对齐困难、罕见类别图像-文本配对数据稀缺而导致的性能瓶颈问题,尤其是在专业领域或复杂物体场景下的表现不佳。其解决方案的关键在于提出PET-DINO,一个支持文本和视觉提示的通用检测器,并引入两个核心创新:一是基于先进文本提示检测器构建的“对齐友好型视觉提示生成模块”(Alignment-Friendly Visual Prompt Generation, AFVPG),有效缓解文本表征引导不足的问题并缩短开发周期;二是两种提示增强训练策略——迭代级的“批内并行提示”(Intra-Batch Parallel Prompting, IBP)与训练级的“动态记忆驱动提示”(Dynamic Memory-Driven Prompting, DMD),实现多提示路径的同时建模,从而提升模型在多样化现实应用场景中的对齐能力与泛化性能。

链接: https://arxiv.org/abs/2604.00503
作者: Weifu Fu,Jinyang Li,Bin-Bin Gao,Jialin Li,Yuhuan Lin,Hanqiu Deng,Wenbing Tao,Yong Liu,Chengjie Wang
机构: YouTu Lab, Tencent; Huazhong University of Science and Technology; Kling Team, Kuaishou Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: this https URL.

[CV-85] PC-SAM: Patch-Constrained Fine-Grained Interactive Road Segmentation in High-Resolution Remote Sensing Images

【速读】:该论文旨在解决当前遥感道路分割中全自动模型存在误检(false positive)和漏检(false negative)问题,以及缺乏局部区域精细化修正能力的局限性。现有方法难以应对复杂场景下的道路识别,且无法支持用户交互式地对特定区域进行局部调整或精细优化。其解决方案的关键在于提出PC-SAM框架,通过精心设计的微调策略,将点提示(point prompt)的影响限制在其对应图像块(patch)范围内,从而克服原始Segment Anything Model (SAM) 在遥感图像上无法实现细粒度局部修正的问题,实现了全自动分割与交互式局部精修的一体化统一。

链接: https://arxiv.org/abs/2604.00495
作者: Chengcheng Lv,Rushi Li,Mincheng Wu,Xiufang Shi,Zhenyu Wen,Shibo He
机构: Zhejiang University of Technology (浙江工业大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road masks obtained from remote sensing images effectively support a wide range of downstream tasks. In recent years, most studies have focused on improving the performance of fully automatic segmentation models for this task, achieving significant gains. However, current fully automatic methods are still insufficient for identifying certain challenging road segments and often produce false positive and false negative regions. Moreover, fully automatic segmentation does not support local segmentation of regions of interest or refinement of existing masks. Although the SAM model is widely used as an interactive segmentation model and performs well on natural images, it shows poor performance in remote sensing road segmentation and cannot support fine-grained local refinement. To address these limitations, we propose PC-SAM, which integrates fully automatic road segmentation and interactive segmentation within a unified framework. By carefully designing a fine-tuning strategy, the influence of point prompts is constrained to their corresponding patches, overcoming the inability of the original SAM to perform fine local corrections and enabling fine-grained interactive mask refinement. Extensive experiments on several representative remote sensing road segmentation datasets demonstrate that, when combined with point prompts, PC-SAM significantly outperforms state-of-the-art fully automatic models in road mask segmentation, while also providing flexible local mask refinement and local road segmentation. The code will be available at this https URL.

[CV-86] ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

【速读】:该论文旨在解决将自回归(auto-regressive)框架从2D图像的下一尺度预测扩展到3D物体生成的问题,这一领域此前研究较少。其核心挑战在于如何高效地生成具有可控细节层次(level of detail)和高视觉保真度的3D内容。解决方案的关键在于提出一种自回归高斯点绘(auto-regressive Gaussian splatting, ARGS)框架,通过引入高斯简化策略并反向执行该策略来引导下一尺度的生成;同时利用分层树结构将生成过程压缩至O(logn)\mathcal{O}(\log n)步(nn为点数),并通过基于树的Transformer模型实现对树结构的自回归预测,使叶节点能够关注其内部祖先,从而增强结构一致性。

链接: https://arxiv.org/abs/2604.00494
作者: Quanyuan Ruan,Kewei Shi,Jiabao Lei,Xifeng Gao,Xiaoguang Han
机构: South China University of Technology (华南理工大学); The University of Hong Kong (香港大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Lightspeed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only (\mathcalO(\log n)) steps, where (n) is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.

[CV-87] A Reasoning -Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

【速读】:该论文旨在解决当前医学影像人工智能(AI)系统在胸部X线(CXR)解读中缺乏可解释性的问题,即现有模型通常仅输出最终诊断预测,而未明确展示视觉证据如何转化为放射学发现和诊断结论。解决方案的关键在于提出CheXOne——一个具备推理能力的视觉-语言模型,能够联合生成诊断预测与显式的、临床语境下的推理轨迹(reasoning traces),这些轨迹将视觉证据、放射学发现与最终预测相连接。该模型通过两阶段训练框架(指令微调结合强化学习)在1470万条标注样本上进行训练,显著提升了推理质量,并在零样本设置下多维度评估中优于现有医疗及通用领域基础模型,同时临床阅读研究证实其生成报告质量可媲美或超越住院医师水平,且有效提升报告撰写与影像解读效率。

链接: https://arxiv.org/abs/2604.00493
作者: Yabin Zhang,Chong Wang,Yunhe Gao,Jiaming Liu,Maya Varma,Justin Xu,Sophie Ostmeier,Jin Long,Sergios Gatidis,Seena Dehkharghani,Arne Michalson,Eun Kyoung Hong,Christian Bluethgen,Haiwei Henry Guo,Alexander Victor Ortiz,Stephan Altmayer,Sandhya Bodapati,Joseph David Janizek,Ken Chang,Jean-Benoit Delbrouck,Akshay S. Chaudhari,Curtis P. Langlotz
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Codes: this https URL Models: this https URL

点击查看摘要

Abstract:Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

[CV-88] All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models CVPR2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升视觉语言模型(Vision-Language Models, VLMs)推理能力时存在的局限性问题,特别是GRPO(Group Relative Policy Optimization)方法容易导致推理策略多样性崩溃(diversity collapse),即模型过早收敛至少数几种推理路径,从而陷入局部最优并缺乏可扩展性。其解决方案的关键在于提出多组策略优化(Multi-Group Policy Optimization, MUPO),通过设计机制激励模型在多个解空间中探索多样化推理路径,从而有效缓解多样性崩溃问题,并在基准测试中验证了其优越性能。

链接: https://arxiv.org/abs/2604.00479
作者: Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Peter Tu,Jing Zhang
机构: Australian National University (澳大利亚国立大学); Shanghai AI Lab (上海人工智能实验室); GE Research (通用电气研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026

点击查看摘要

Abstract:Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: this https URL

[CV-89] Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation

【速读】:该论文旨在解决超高场强7特斯拉(7T)磁共振成像(MRI)在多发性硬化(MS)白质病变(WML)自动分割中面临的挑战,即传统基于低场强(1.5–3T)开发的分割工具因成像对比度和伪影差异而难以直接迁移应用的问题。其解决方案的关键在于:首先构建了基于专家人工修正的参考标准掩膜,并在此基础上训练了两种基于Transformer架构的深度学习模型(3D UNETR 和 SegFormer),这些模型在7T FLAIR图像的不同分辨率下进行优化,特别是保留了原始0.5×0.5×0.5 mm³高分辨率数据以提升对小病灶的检出能力;实验表明,所提方法在保持与先进LST-AI相当的总体重叠度的同时,能识别出更多被传统方法遗漏的小病灶,显著优于经典LST-LPA工具(voxel-wise Dice从0.39提升至0.61,lesion-wise Dice从0.02提升至0.20),验证了针对7T特性定制训练策略的有效性。

链接: https://arxiv.org/abs/2604.00469
作者: Michael Maynord,Minghui Liu,Cornelia Fermüller,Seongjin Choi,Yuxin Zeng,Shishir Dahal,Daniel M. Harrison
机构: University of Maryland (马里兰大学); University of Maryland School of Medicine (马里兰大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 3 figures, 3 tables. Inference code and model weights available at this https URL

点击查看摘要

Abstract:Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (this https URL).

[CV-90] Out of Sight Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation CVPR2026

【速读】:该论文旨在解决基于查询传播(Tracking-by-Query-Propagation, TBP)的多目标跟踪(Multi-Object Tracking, MOT)方法在面对对抗攻击时所暴露的架构脆弱性问题。现有TBP方法虽通过端到端(end-to-end, E2E)设计实现长时程建模,但其依赖查询传播机制易受恶意干扰。解决方案的关键在于提出FADE攻击框架,该框架包含两种针对性策略:(i) 时间查询淹没(Temporal Query Flooding),通过生成时空一致的虚假查询耗尽追踪器有限的查询预算,迫使有效轨迹终止;(ii) 时间记忆污染(Temporal Memory Corruption),直接攻击查询更新器的记忆模块,通过状态去相关和匹配轨迹特征身份擦除破坏时序关联。此外,作者构建可微分管道以优化攻击在物理世界中的可实现性,利用先进感知传感器欺骗的仿真进行训练,实验证明FADE对当前最先进TBP跟踪器具有显著破坏效果,引发大量身份切换与轨迹中断。

链接: https://arxiv.org/abs/2604.00452
作者: Halima Bouzidi,Haoyu Liu,Yonatan Gizachew Achamyeleh,Praneetsai Vasu Iddamsetty,Mohammad Abdullah Al Faruque
机构: University of California, Irvine, CA, USA (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at CVPR 2026 (main track)

点击查看摘要

Abstract:Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces unexplored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally consistent track queries to exhaust the tracker’s limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater’s memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for physical-world realizability by leveraging simulations of advanced perception sensor spoofing. Experiments on MOT17 and MOT20 benchmarks demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.

[CV-91] Learning Humanoid Navigation from Human Data

【速读】:该论文旨在解决人形机器人在多样化、未见过的环境中实现自主导航的问题,尤其在缺乏机器人自身数据或微调的情况下如何实现鲁棒性和泛化能力。解决方案的关键在于构建一个端到端的学习框架EgoNav,其核心包括:1)基于扩散模型(diffusion model)预测条件于历史轨迹的未来轨迹分布;2)融合360°视觉记忆(包含颜色、深度和语义信息)以及来自冻结DINOv3骨干网络的视频特征,以捕捉深度传感器无法感知的外观线索;3)采用混合采样策略实现实时推理(仅需10步去噪),并结合滚动时域控制器从预测分布中选择最优路径。该方法无需额外训练即可在未见场景中自然涌现出如等待门开启、避让人群和识别玻璃墙等复杂行为。

链接: https://arxiv.org/abs/2604.00416
作者: Weizhuo Wang,Yanjie Ze,C. Karen Liu,Monroe Kennedy III
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages 8 figures

点击查看摘要

Abstract:We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: this https URL

[CV-92] he 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLM s Meet SAM3 for Referring Video Object Segmentation CVPR2026

【速读】:该论文旨在解决基于运动中心语言表达的视频目标分割(motion-centric language expressions for video object segmentation, MeViS-Text)问题,要求模型同时理解目标外观、时间行为和对象间交互关系。解决方案的关键在于构建一个完全无需任务微调(training-free)的三阶段流水线:首先利用Gemini-3.1 Pro将目标事件分解为实例级定位目标,并生成判别性描述;其次通过SAM3-agent在关键帧上生成精确种子掩码,并由官方SAM3跟踪器传播至整段视频;最后采用Qwen3.5-Plus结合行为级验证对模糊或语义不一致的预测进行精修。该方法在PVUW 2026 MeViS-Text测试集上取得最优性能,Final分数达0.909064,JF分数为0.7897。

链接: https://arxiv.org/abs/2604.00404
作者: Xusheng He,Canyang Wu,Jinrong Zhang,Weili Guan,Jianlong Wu,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen, China; Shenzhen Loop Area Institute, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st Place Solution for the 5th PVUW MeViS-Text Challenge (CVPR 2026 Workshop)

点击查看摘要

Abstract:This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a JF score of 0.7897. The code is available at this https URL.

[CV-93] COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

【速读】:该论文旨在解决当前轨迹预测模型在跨地域部署时性能下降的问题,尤其是当基于西方道路环境(如美国)训练的模型迁移到韩国等不同地理区域时,由于交通模式、基础设施和驾驶行为的差异导致的域差距(domain discrepancy)。其解决方案的关键在于通过系统性比较四种迁移学习策略——零样本迁移、从头训练、全量微调和编码器冻结——发现仅微调解码器而冻结编码器的方法能够在保持较高预测精度的同时显著提升训练效率,相较从头训练可将预测误差降低超过66%,从而为跨地域部署生成式轨迹预测模型提供了高效且实用的迁移策略。

链接: https://arxiv.org/abs/2604.00402
作者: Seohyoung Park,Jaeyeol Lim,Seoyoung Ju,Kyeonghun Kim,Nam-Joon Kim,Hyuk-Jae Lee
机构: Ewha Womans University (梨花女子大学); Seoul National University (首尔国立大学); Sangmyung University (尚志大学); OUTTA; NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures. Accepted at ICEIC 2026

点击查看摘要

Abstract:Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

[CV-94] Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

【速读】:该论文旨在解决多中心脑转移瘤(Brain Metastases, BM)分割模型在不同医疗机构间性能下降的问题,其核心挑战源于扫描设备、成像协议及患者群体的异质性。解决方案的关键在于提出一种基于变分自编码器与最大均值差异(VAE-MMD)损失相结合的预处理流程,通过引入跳接连接和自注意力机制增强特征对齐能力,并与nnU-Net分割网络集成,从而在无需目标域标签的情况下有效降低跨机构数据分布差异,显著提升分割精度与泛化能力。

链接: https://arxiv.org/abs/2604.00397
作者: Yuchen Yang,Shuangyang Zhong,Haijun Yu,Langcuomu Suo,Hongbin Han,Florian Putz,Yixing Huang
机构: Peking University Health Science Center (北京大学医学部); Beijing Key Laboratory of Intelligent Neuromodulation and Brain Disorder Treatment (北京市智能神经调控与脑疾病治疗重点实验室); University Hospital Erlangen (埃尔朗根大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 figures and 1 table

点击查看摘要

Abstract:Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier’s accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation. Comments: 5 figures and 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.00397 [cs.CV] (or arXiv:2604.00397v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.00397 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yixing Huang [view email] [v1] Wed, 1 Apr 2026 02:25:25 UTC (5,330 KB)

[CV-95] VLM-in-the-Loop: A Plug-In Quality Assurance Module for ECG Digitization Pipelines

【速读】:该论文旨在解决心电图(ECG)数字化过程中在真实临床图像上性能严重下降的问题,尽管现有方法在基准测试中表现优异。其核心解决方案是提出一种名为“VLM-in-the-Loop”的插件式质量保障模块,通过标准化接口将任何数字化后端与视觉语言模型(VLM)的闭环反馈相结合,无需修改底层数字转换器。该方案的关键创新在于“工具锚定”(tool grounding)机制——即利用领域特定信号分析工具提供的定量证据来约束和增强VLM的评估准确性,从而显著提升判读一致性与保真度分离能力,且效果在多个主流VLM模型及不同数字化后端中均具有一致性,表明其为模式层面的改进而非模型依赖性收益。

链接: https://arxiv.org/abs/2604.00396
作者: Jiachen Li,Shihao Li,Soovadeep Bakshi,Wei Li,Dongmei Chen
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:ECG digitization could unlock billions of archived clinical records, yet existing methods collapse on real-world images despite strong benchmark numbers. We introduce \textbfVLM-in-the-Loop, a plug-in quality assurance module that wraps any digitization backend with closed-loop VLM feedback via a standardized interface, requiring no modification to the underlying digitizer. The core mechanism is \textbftool grounding: anchoring VLM assessment in quantitative evidence from domain-specific signal analysis tools. In a controlled ablation on 200 records with paired ground truth, tool grounding raises verdict consistency from 71% to 89% and doubles fidelity separation ( \Delta PCC 0.03 \rightarrow 0.08), with the effect replicating across three VLMs (Claude Opus~4, GPT-4o, Gemini~2.5 Pro), confirming a pattern-level rather than model-specific gain. Deployed across four backends, the module improves every one: 29.4% of borderline leads improved on our pipeline; 41.2% of failed limb leads recovered on ECG-Digitiser; valid leads per image doubled on Open-ECG-Digitizer (2.5 \rightarrow 5.8). On 428 real clinical HCM images, the integrated system reaches 98.0% Excellent quality. Both the plug-in architecture and tool-grounding mechanism are domain-parametric, suggesting broader applicability wherever quality criteria are objectively measurable.

[CV-96] Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge CVPR2026

【速读】:该论文旨在解决复杂视频目标分割(Complex Video Object Segmentation)任务中,现有最优方法SAM3在处理微小目标和语义主导目标时性能显著下降的问题。其根本原因在于SAM3对这类特定目标类型的理解能力不足。解决方案的关键在于提出一种无需训练的框架TEP(Tracking-Enhanced Prompts),通过引入外部跟踪模型与多模态大语言模型(Multimodal Large Language Models, MLLMs)生成增强型提示(tracking-enhanced prompts),从而提升SAM3对困难目标的感知与理解能力,最终在PVUW Challenge 2026的测试集上取得56.91%的最高得分。

链接: https://arxiv.org/abs/2604.00395
作者: Jinrong Zhang,Canyang Wu,Xusheng He,Weili Guan,Jianlong Wu,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); Shenzhen Loop Area Institute, China (深圳环区研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st Place Solution for the 5th PVUW MOSE Challenge (CVPR 2026 Workshop)

点击查看摘要

Abstract:In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method’s capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3’s insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.

[CV-97] Mine-JEPA: In-Domain Self-Supervised Learning for Mine-Like Object Classification in Side-Scan Sonar CVPR2026

【速读】:该论文旨在解决侧扫声呐(Side-scan sonar, SSS)图像中水雷分类问题,其核心挑战在于数据极度稀缺以及与自然图像之间存在显著域差异(domain gap)。为应对这一难题,作者提出Mine-JEPA,这是首个专用于SSS场景的自监督学习(Self-supervised learning, SSL)预训练流水线,其关键创新在于采用SIGReg——一种基于正则化的SSL损失函数,在仅1,170张未标注声呐图像上进行预训练。实验表明,Mine-JEPA在二分类(水雷 vs. 非水雷)和三分类任务中均优于微调后的DINOv3(一个在17亿张自然图像上预训练的视觉基础模型),且使用更少参数(ViT-Tiny架构)即实现竞争力性能,同时发现对基础模型进行领域内SSL微调反而会显著降低性能(下降10–13个百分点),揭示了在数据稀缺场景下,针对性设计的领域内SSL方法比盲目迁移大型基础模型更具优势。

链接: https://arxiv.org/abs/2604.00383
作者: Taeyoun Kwon,Youngwon Choi,Hyeonyu Kim,Myeongkyun Cho,Junhyeok Choi,Moon Hwan Kim
机构: Maum AI Inc.(Maum AI 公司); Seoul National University (首尔国立大学); KAIST (韩国科学技术院); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, 6 tables. Accepted at CVPR 2026 MACVi Workshop

点击查看摘要

Abstract:Side-scan sonar (SSS) mine classification is a challenging maritime vision problem characterized by extreme data scarcity and a large domain gap from natural images. While self-supervised learning (SSL) and general-purpose vision foundation models have shown strong performance in general vision and several specialized domains, their use in SSS remains largely unexplored. We present Mine-JEPA, the first in-domain SSL pipeline for SSS mine classification, using SIGReg, a regularization-based SSL loss, to pretrain on only 1,170 unlabeled sonar images. In the binary mine vs. non-mine setting, Mine-JEPA achieves an F1 score of 0.935, outperforming fine-tuned DINOv3 (0.922), a foundation model pretrained on 1.7B images. For 3-class mine-like object classification, Mine-JEPA reaches 0.820 with synthetic data augmentation, again outperforming fine-tuned DINOv3 (0.810). We further observe that applying in-domain SSL to foundation models degrades performance by 10–13 percentage points, suggesting that stronger pretrained models do not always benefit from additional domain adaptation. In addition, Mine-JEPA with a compact ViT-Tiny backbone achieves competitive performance while using 4x fewer parameters than DINOv3. These results suggest that carefully designed in-domain self-supervised learning is a viable alternative to much larger foundation models in data-scarce maritime sonar imagery.

[CV-98] mmAnomaly: Leverag ing Visual Context for Robust Anomaly Detection in the Non-Visual World with mmWave Radar

【速读】:该论文旨在解决毫米波雷达(mmWave radar)在异常检测中因信号反射受材料特性、杂波和多径干扰影响而导致的复杂非高斯畸变问题,现有方法缺乏上下文感知能力,常将良性信号变化误判为异常。其解决方案的关键在于提出mmAnomaly框架,通过融合毫米波雷达与RGBD输入,利用快速ResNet分类器提取场景几何和材质等语义线索,并基于条件潜在扩散模型生成与视觉上下文一致的预期毫米波频谱;随后通过双输入对比模块识别实测与合成频谱间的空间偏差以精确定位异常。该设计实现了对复杂环境下的鲁棒异常检测与可解释性定位。

链接: https://arxiv.org/abs/2604.00382
作者: Tarik Reza Toha,Shao-Jung(Louie)Lu,Mahathir Monjur,Shahriar Nirjon
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted at the 24th ACM/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems (SenSys 2026)

点击查看摘要

Abstract:mmWave radar enables human sensing in non-visual scenarios-e.g., through clothing or certain types of walls-where traditional cameras fail due to occlusion or privacy limitations. However, robust anomaly detection with mmWave remains challenging, as signal reflections are influenced by material properties, clutter, and multipath interference, producing complex, non-Gaussian distortions. Existing methods lack contextual awareness and misclassify benign signal variations as anomalies. We present mmAnomaly, a multi-modal anomaly detection framework that combines mmWave radar with RGBD input to incorporate visual context. Our system extracts semantic cues-such as scene geometry and material properties-using a fast ResNet-based classifier, and uses a conditional latent diffusion model to synthesize the expected mmWave spectrum for the given visual context. A dual-input comparison module then identifies spatial deviations between real and generated spectra to localize anomalies. We evaluate mmAnomaly on two multi-modal datasets across three applications: concealed weapon localization, through-wall intruder localization, and through-wall fall localization. The system achieves up to 94% F1 score and sub-meter localization error, demonstrating robust generalization across clothing, occlusions, and cluttered environments. These results establish mmAnomaly as an accurate and interpretable framework for context-aware anomaly detection in mmWave sensing.

[CV-99] UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration

【速读】:该论文旨在解决Under-display Camera (UDC) 图像恢复中因光衍射和散射导致的空间非均匀退化问题,该问题显著削弱了图像的高频细节。现有基于点扩散函数(PSF)的物理建模方法和频域分离网络虽能有效恢复低频结构并保持色彩一致性,但在处理复杂空间变化退化时仍难以恢复精细纹理。其解决方案的关键在于提出一种轻量级的不确定性感知上下文记忆网络(Uncertainty-aware Context-Memory Network, UCMNet),通过引入不确定性驱动损失学习空间不确定性图,量化由衍射和散射引起的局部不确定性,并以此作为先验引导记忆库从上下文库中检索区域自适应的上下文信息,从而实现对UDC成像中非均匀退化特性的精准建模,最终在多个基准上达到SOTA性能且参数减少30%。

链接: https://arxiv.org/abs/2604.00381
作者: Daehyun Kim,Youngmin Kim,Yoon Ju Oh,Tae Hyun Kim
机构: Hanyang University (汉阳大学); Agency for Defense Development (韩国国防发展局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We propose UCMNet, an uncertainty-aware adaptive framework that restores high-frequency details in regions with varying levels of degradation in under-display camera images

点击查看摘要

Abstract:Under-display cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight \textbfUncertainty-aware \textbfContext-\textbfMemory \textbfNetwork (\textbfUCMNet), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30% fewer parameters than previous models. Project page: \hrefthis https URLthis https URL.

[CV-100] Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

【速读】:该论文旨在解决RGB-D(彩色与深度)多模态数据中关键局部特征的自适应选择与有效利用问题,以提升室内场景识别的准确性。其解决方案的关键在于提出了一种具有自适应节点选择机制的动态图模型:该模型通过构建动态图来建模物体与场景之间的关系,并采用自适应方法从RGB和深度模态中提取关键局部特征用于图结构建模;同时,这些节点按三个层次分组以表示物体间的远近关系,并根据注意力权重动态更新图结构,最终融合优化后的RGB与深度特征实现更精准的室内场景识别。

链接: https://arxiv.org/abs/2604.00372
作者: Qiong Liu,Ruofei Xiong,Xingzhen Chen,Muyao Peng,You Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.

[CV-101] Neural Reconstruction of LiDAR Point Clouds under Jamming Attacks via Full-Waveform Representation and Simultaneous Laser Sensing

【速读】:该论文旨在解决激光雷达(LiDAR)在自动驾驶感知中易受干扰攻击(jamming attack)的问题,此类攻击通过注入高频激光脉冲使LiDAR传感器完全失效。解决方案的关键在于利用现代LiDAR系统中未被充分挖掘的中间全波形(full-waveform)数据表示,提出PULSAR-Net模型,该模型采用带有轴向空间注意力机制的新型U-Net架构,能够从全波形中区分攻击信号与真实目标回波,并重建受干扰场景下的原始点云。此外,研究还构建了基于物理规律的数据生成管道,以合成真实感强的带干扰全波形数据,从而实现仅用合成数据训练即可在真实静态和动态场景下分别达到92%和73%的重建准确率。

链接: https://arxiv.org/abs/2604.00371
作者: Ryo Yoshida,Takami Sato,Wenlun Zhang,Yuki Hayakawa,Shota Nagai,Takahiro Kado,Taro Beppu,Ibuki Fujioka,Yunshan Zhong,Kentaro Yoshioka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR sensors are critical for autonomous driving perception, yet remain vulnerable to spoofing attacks. Jamming attacks inject high-frequency laser pulses that completely blind LiDAR sensors by overwhelming authentic returns with malicious signals. We discover that while point clouds become randomized, the underlying full-waveform data retains distinguishable signatures between attack and legitimate signals. In this work, we propose PULSAR-Net, capable of reconstructing authentic point clouds under jamming attacks by leveraging previously underutilized intermediate full-waveform representations and simultaneous laser sensing in modern LiDAR systems. PULSAR-Net adopts a novel U-Net architecture with axial spatial attention mechanisms specifically designed to identify attack-induced signals from authentic object returns in the full-waveform representation. To address the lack of full-waveform representations in existing LiDAR datasets under jamming attacks, we introduce a physics-aware dataset generation pipeline that synthesizes realistic full-waveform representations under jamming attacks. Despite being trained exclusively on synthetic data, PULSAR-Net achieves reconstruction rates of 92% and 73% for vehicles obscured by jamming attacks in real-world static and driving scenarios, respectively.

[CV-102] A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person Tracking

【速读】:该论文旨在解决自主移动机器人在复杂光照条件下(如完全黑暗或强背光)进行鲁棒人体跟踪的问题。传统基于RGB-D的跟踪方法在此类环境中性能显著下降,而现有方案缺乏适用于多模态传感器融合的标注数据支持。解决方案的关键在于提出一种热红外与深度(Thermal-Infrared and Depth, TIR-D)跟踪架构,利用SLAM能力机器人标配的LiDAR和热红外(TIR)相机,并通过一种顺序知识迁移策略,将大规模热红外训练模型的结构先验逐步迁移到TIR-D域;同时采用“细粒度差异学习率策略”(Fine-grained Differential Learning Rate Strategy),在保留预训练特征提取能力的同时,快速适应几何深度信息,从而实现全天候鲁棒的人体跟踪性能。

链接: https://arxiv.org/abs/2604.00363
作者: Yuki Minase,Kanji Tanaka
机构: University of Fukui(福井大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, technical report

点击查看摘要

Abstract:Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy – referred to as ``Fine-grained Differential Learning Rate Strategy’’ – we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.

[CV-103] VADMamba: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space

【速读】:该论文旨在解决现有视频异常检测(Video Anomaly Detection, VAD)方法对辅助输入(如光流)依赖过强及多任务融合策略限制单任务适用性的问题。其关键解决方案是提出VADMamba++,采用灰度到RGB的单通道到三通道重建范式,强制模型从灰度结构中推断颜色信息,从而通过结构与色度线索间的双重不一致性更有效地揭示异常;同时设计融合Mamba、CNN与Transformer的混合建模骨干网络以捕捉多样化的正常模式并抑制异常表征,并引入任务内融合评分机制,整合显式未来帧预测误差与隐式量化特征误差,在仅使用帧级输入的严格单任务设置下显著提升检测精度与效率。

链接: https://arxiv.org/abs/2604.00360
作者: Jihao Lyu,Minghua Zhao,Jing Hu,Yifei Chen,Shuangli Du,Cheng Shi
机构: Xi’an University of Technology (西安理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.

[CV-104] Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings

【速读】:该论文旨在解决水下影像中物种分类任务因专家标注成本高昂而导致的标注数据稀缺问题,以及现有监督模型在新场景下泛化能力差的问题。其解决方案的关键在于利用冻结的DINOv3 ViT-B视觉基础模型(foundation model)嵌入表示,通过基于最近邻的自训练机制将少量标注种子样本传播至大量未标注数据中,从而实现标签效率显著提升。该方法无需微调、无需领域特定的数据工程或水下适配模型,在仅使用不到5%标注数据的情况下即可接近全监督ConvNeXt模型性能,且在完全标注条件下部分物种表现超越基线,验证了冻结嵌入空间中类别可分性(以ROC-AUC衡量)在极端标签稀缺时依然保持良好,为海洋生物识别提供了一个即插即用的高效基准方案。

链接: https://arxiv.org/abs/2604.00313
作者: Thomas Manuel Rost
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.

[CV-105] SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction

【速读】:该论文旨在解决医学影像中因运动伪影导致图像质量下降的问题,特别是在胎儿磁共振成像(fetal MRI)中,真实配对数据难以获取且传统方法依赖文本引导的生成模型限制了图像翻译的精度与效率。解决方案的关键在于提出SANA-I2I框架,这是一种无需文本条件的高分辨率图像到图像生成方法,通过在潜在空间中学习基于配对源-目标图像的条件流匹配(conditional flow-matching)模型,构建一个映射目标图像分布的条件速度场(conditional velocity field),从而实现监督式图像翻译。该方法摒弃了语言提示依赖,仅利用合成数据模拟真实运动伪影进行训练,在保持解剖结构完整性的同时显著抑制伪影,且在极少推理步骤下即达到竞争性性能,凸显了其在医学图像处理中的高效性与适用性。

链接: https://arxiv.org/abs/2604.00298
作者: Italo Felix Santos,Gilson Antonio Giraldi,Heron Werner Junior
机构: National Laboratory for Scientific Computing - LNCC (国家科学计算实验室); Biodesign Laboratory Dasa (生物设计实验室达萨)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose SANA-I2I, a text-free high-resolution image-to-image generation framework that extends the SANA family by removing textual conditioning entirely. In contrast to SanaControlNet, which combines text and image-based control, SANA-I2I relies exclusively on paired source-target images to learn a conditional flow-matching model in latent space. The model learns a conditional velocity field that maps a target image distribution to another one, enabling supervised image translation without reliance on language prompts. We evaluate the proposed approach on the challenging task of fetal MRI motion artifact reduction. To enable paired training in this application, where real paired data are difficult to acquire, we adopt a synthetic data generation strategy based on the method proposed by Duffy et al., which simulates realistic motion artifacts in fetal magnetic resonance imaging (MRI). Experimental results demonstrate that SANA-I2I effectively suppresses motion artifacts while preserving anatomical structure, achieving competitive performance few inference steps. These results highlight the efficiency and suitability of our proposed flow-based, text-free generative models for supervised image-to-image tasks in medical imaging.

[CV-106] he Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中存在的模态间隙(modality gap)问题,即图像和文本在共享嵌入空间中虽然被映射到同一空间,但其几何分布仍存在显著分离,从而限制了跨模态任务如图像描述生成和联合聚类的性能。现有后处理方法仅能缓解全局中心偏移(centroid offset),无法解决底层分布不匹配(distributional mismatch)这一核心问题。论文的关键创新在于提出一种三阶段课程学习框架(Three-Phase Curriculum for Cross-Modal Alignment, TPC-CMA),通过显式分解模态间隙为“中心差距”(Centroid Gap)与“分布差距”(Distribution Gap),并设计联合优化策略同时缩小两者;同时引入梯度感知调度机制,在训练过程中逐步引入对齐信号以实现稳定优化。实验表明,该方法显著提升了跨模态一致性,尤其在强对齐条件下,聚类ARI提升达62.9%,图像描述CIDEr提升57.1%。

链接: https://arxiv.org/abs/2604.00279
作者: Hongyuan Liu,Qinli Yang,Wen Li,Zhong Zhang,Jiaming Liu,Wei Han,Zhili Qin,Jinxia Guo,Junming Shao
机构: University of Electronic Science and Technology of China (电子科技大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ( R^2 = 0.986 ), whereas the commonly used Raw Gap is misleading ( R^2 = 0.691 ). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With \alpha_\texttarget=0.05 , the modality gap is reduced by 66.6% with only 4.84% accuracy drop. Under stronger alignment ( \alpha_\texttarget=0.5 ), the gap is reduced by 82.3%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.

[CV-107] Excite Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling

【速读】:该论文旨在解决现有无监督分割方法在复杂多组件形态场景中难以实现细粒度结构识别的问题,此类场景下依赖粗粒度的块级(patch-level)特征表示会抑制必要的细节信息。其解决方案的关键在于提出一个名为EASe(Excite, Attend and Segment)的无监督域无关语义分割框架,该框架通过两项核心技术实现:一是Semantic-Aware Upsampling with Channel Excitation (SAUCE),用于激发低分辨率基础模型(foundation model, FM)特征通道并进行选择性校准,同时跨空间编码图像与FM特征以恢复全分辨率语义表征;二是无需训练的Cue-Attentive Feature Aggregator (CAFE),利用SAUCE注意力得分作为语义分组信号,将聚合特征分割为多粒度掩码。EASe直接在像素级特征表示上操作,从而实现高精度的细粒度密集语义掩码发现。

链接: https://arxiv.org/abs/2604.00276
作者: Deepank Singh,Anurag Nihal,Vedhus Hoskere
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at this https URL

[CV-108] OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在将印刷电路板(Printed Circuit Board, PCB)原理图转化为可机器读取的、包含组件属性、连接关系与几何信息的空间加权网表图(spatially weighted netlist graphs)方面能力不足的问题,而这类图表示正是电子设计自动化(Electronic Design Automation, EDA)流程的核心。解决方案的关键在于提出OmniSch——首个全面评估LMMs在原理图理解与空间网表图构建能力的基准数据集,涵盖1,854张真实世界原理图及四项任务:(1)原理图实体视觉定位;(2)图到图推理以理解元素间拓扑关系;(3)几何推理以生成连接权重;(4)工具增强型代理推理用于视觉搜索。实验揭示了现有LMMs在细粒度定位、布局到图解析、全局连通性推理和视觉探索效率等方面的显著差距,为后续研究提供了明确方向。

链接: https://arxiv.org/abs/2604.00270
作者: Taiting Lu,Kaiyuan Lin,Yuxin Tian,Yubo Wang,Muchuan Wang,Sharique Khatri,Akshit Kartik,Yixi Wang,Amey Santosh Rane,Yida Wang,Yifan Yang,Yi-Chao Chen,Yincheng Jin,Mahanth Gowda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.

[CV-109] Omni-MMSI: Toward Identity-attributed Social Interaction Understanding CVPR2026

【速读】:该论文旨在解决多模态社会交互理解(Multi-Modal Social Interaction Understanding, MMSI)任务中,现有模型在处理原始音频、视觉和语音输入时缺乏可靠的身份归属能力(identity attribution),导致对社交互动的推理不准确的问题。其关键解决方案是提出Omni-MMSI-R,一个基于参考引导(reference-guided)的流水线方法,通过工具辅助生成带有身份标注的社会线索,并结合链式思维(chain-of-thought)进行社会推理;同时构建了参与者级别的参考对(participant-level reference pairs)并在此基础上标注推理注释,从而显著提升模型在真实场景下对社会交互的理解性能。

链接: https://arxiv.org/abs/2604.00267
作者: Xinpeng Li,Bolin Lai,Hardy Chen,Shijian Deng,Cihang Xie,Yuyin Zhou,James Matthew Rehg,Yapeng Tian
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); Georgia Institute of Technology (佐治亚理工学院); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: this https URL.

[CV-110] Benchmarking Interaction Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

【速读】:该论文旨在解决现有协同实例对象导航(Collaborative Instance Object Navigation, CoIN)基准在评估交互能力方面的不足问题,特别是缺乏对导航与协作提问任务的独立、可复现的量化评估机制。现有方法主要关注导航成功率,忽视了对话式交互在解决视觉相似目标歧义中的关键作用。解决方案的关键在于提出首个可复现的基准QAsk-Nav,其核心创新包括:(i) 一种轻量级独立评分的提问协议,用于分离评估交互质量;(ii) 增强的导航协议,包含多样且高质量的目标描述;(iii) 开源数据集,包含28,000条经人工校验的推理与提问轨迹,支持模型交互能力训练与分析。基于此基准,作者进一步开发了Light-CoNav模型,实现了参数量减少3倍、推理速度提升70倍的同时,在未见物体和环境上的泛化性能优于当前最优CoIN方法。

链接: https://arxiv.org/abs/2604.00265
作者: Edoardo Zorzi,Francesco Taioli,Yiming Wang,Marco Cristani,Alessandro Farinelli,Alberto Castellini,Loris Bazzani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at this https URL

[CV-111] PRISM: Differentiable Analysis-by-Synthesis for Fixel Recovery in Diffusion MRI

【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI)微结构拟合中因非凸优化和逐体素(voxelwise)处理导致的纤维峰恢复受限问题,尤其是在纤维交叉角度较小时难以准确分辨的问题。其解决方案的关键在于提出PRISM框架——一个可微分的“分析-合成”(analysis-by-synthesis)方法,该方法在空间块(spatial patches)上端到端地拟合显式多 compartment 前向模型,包含脑脊液(CSF)、灰质、最多K个白质纤维分量(stick-and-zeppelin)及受限扩散分量,并通过排斥力和稀疏性先验实现纤维方向显式估计与软模型选择。此外,PRISM支持快速均方误差(MSE)目标函数和Rician负对数似然(NLL)损失,后者可联合学习噪声标准差σ而无需先验信息,显著提升在低信噪比(SNR=30)下对20°交叉纤维的分辨能力,同时引入轻量级伪影校准模块增强鲁棒性。

链接: https://arxiv.org/abs/2604.00250
作者: Mohamed Abouagour,Atharva Shah,Eleftherios Garyfallidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Diffusion MRI microstructure fitting is nonconvex and often performed voxelwise, which limits fiber peak recovery in narrow crossings. This work introduces PRISM, a differentiable analysis-by-synthesis framework that fits an explicit multi-compartment forward model end-to-end over spatial patches. The model combines cerebrospinal fluid (CSF), gray matter, up to K white-matter fiber compartments (stick-and-zeppelin), and a restricted compartment, with explicit fiber directions and soft model selection via repulsion and sparsity priors. PRISM supports a fast MSE objective and a Rician negative log-likelihood (NLL) that jointly learns sigma without oracle information. A lightweight nuisance calibration module (smooth bias field and per-measurement scale/offset) is included for robustness and regularized to identity in clean-data tests. On synthetic crossing-fiber data (SNR=30; five methods, 16 crossing angles), PRISM achieves 3.5 degrees best-match angular error with 95% recall, which is 1.9x lower than the best baseline (MSMT-CSD, 6.8 degrees, 83% recall); in NLL mode with learned sigma, error drops to 2.3 degrees with 99% recall, resolving crossings down to 20 degrees. On the DiSCo1 phantom (NLL mode), PRISM improves connectivity correlation over CSD baselines at all four tracking angles (best r=.934 at 25 degrees vs. .920 for MSMT-CSD). Whole-brain HCP fitting (~741k voxels, MSE mode) completes in ~12 min on a single GPU with near-identical results across random seeds.

[CV-112] UCell: rethinking generalizability and scaling of bio-medical vision models

【速读】:该论文旨在解决生物医学研究中因高质量标注数据稀缺而导致的模型规模受限问题,即在小数据场景下如何提升小型模型的性能与泛化能力。其解决方案的关键在于设计了一种参数高效的小型神经网络架构——UCell,通过在前向计算图中引入递归结构(recursive structure),显著提升了模型的表达能力,使其在单细胞分割任务上达到比自身大10–20倍的大型模型相当甚至更优的性能,并且无需依赖自然图像上的大规模预训练,仅需显微成像数据即可从头训练,从而摆脱对商业数据源的依赖。此外,实验验证了UCell在少量样本下的强适应性,支持多种少样本和一次性微调场景。

链接: https://arxiv.org/abs/2604.00243
作者: Nicholas Kuang,Vanessa Scalon,Ji Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The modern deep learning field is a scale-centric one. Larger models have been shown to consistently perform better than smaller models of similar architecture. In many sub-domains of biomedical research, however, the model scaling is bottlenecked by the amount of available training data, and the high cost associated with generating and validating additional high quality data. Despite the practical hurdle, the majority of the ongoing research still focuses on building bigger foundation models, whereas the alternative of improving the ability of small models has been under-explored. Here we experiment with building models with 10-30M parameters, tiny by modern standards, to perform the single-cell segmentation task. An important design choice is the incorporation of a recursive structure into the model’s forward computation graph, leading to a more parameter-efficient architecture. We found that for the single-cell segmentation, on multiple benchmarks, our small model, UCell, matches the performance of models 10-20 times its size, and with a similar generalizability to unseen out-of-domain data. More importantly, we found that ucell can be trained from scratch using only a set of microscopy imaging data, without relying on massive pretraining on natural images, and therefore decouples the model building from any external commercial interests. Finally, we examined and confirmed the adaptability of ucell by performing a wide range of one-shot and few-shot fine tuning experiments on a diverse set of small datasets. Implementation is available at this https URL

[CV-113] QUEST: A robust attention formulation using query-modulated spherical attention ICLR2026

【速读】:该论文旨在解决标准Transformer模型中因查询(query)和键(key)向量范数无限制增长而导致的训练不稳定性问题,尤其是在数据中存在易于学习的虚假模式(spurious patterns)时。其解决方案的关键在于提出一种新的注意力机制——QUEry-modulated Spherical aTtention (QUEST),该机制将键向量约束在超球面(hyperspherical)潜在空间中,同时保留每个token对注意力分布锐度的灵活控制能力,从而实现稳定训练、性能提升以及对数据扰动和对抗攻击的鲁棒性。

链接: https://arxiv.org/abs/2604.00199
作者: Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten
机构: Linköping University (林雪平大学); Qualcomm Auto Ltd Sweden Filial (高通汽车有限公司瑞典分公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method’s generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.

[CV-114] Sit-to-Stand Transitions Detection and Duration Measurement Using Smart Lacelock Sensor

【速读】:该论文旨在解决老年人在日常生活中因平衡能力下降导致的跌倒风险评估与移动能力监测难题,特别是通过客观量化坐站转换(Sit-to-Stand, SiSt)这一关键功能性动作来反映下肢力量、肌肉骨骼健康状况及跌倒风险。其解决方案的关键在于开发并验证了一种基于Smart Lacelock传感器的自动化检测与持续时间测量方法,该传感器为轻量级鞋内设备,集成负载传感器(load cell)、加速度计和陀螺仪,能够捕捉多模态运动信号;研究采用四种机器学习分类器对16名老年受试者在短物理功能电池(SPPB)协议下的SiSt任务进行分析,其中袋装决策树(bagged tree)分类器表现出最优性能(准确率0.98,F1分数0.8),且正确识别的SiSt转换持续时间误差均值仅为0.047秒(标准差0.07秒),证明该系统具备高精度、可穿戴、非侵入式的特点,适用于真实场景中的跌倒风险筛查与长期移动能力追踪。

链接: https://arxiv.org/abs/2604.00175
作者: Md Rafi Islam,Md Rejwanul Haque,Elizabeth Choma,Shannon Hayes,Siobhan McMahon,Xiangrong Shen,Edward Sazonov
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:Postural stability during movement is fundamental to independent living, fall prevention, and overall health, particularly among older adults who experience age-related declines in balance, muscle strength, and mobility. Among daily functional activities, the Sit-to-Stand (SiSt) transition is a critical indicator of lower-limb strength, musculoskeletal health, and fall risk, making it an essential parameter for assessing functional capacity and monitoring physical decline in aging populations. This study presents a methodology SiSt transition detection and duration measurement using the Smart Lacelock sensor, a lightweight, shoe-mounted device that integrates a load cell, accelerometer, and gyroscope for motion analysis. The methodology was evaluated in 16 older adults (age: mean: 76.84, SD: 3.45 years) performing SiSt tasks within the Short Physical Performance Battery (SPPB) protocol. Features extracted from multimodal signals were used to train and evaluate four machine learning classifiers using a 4-fold participant-independent cross-validation to classify SiSt transitions and measure their duration. The bagged tree classifier achieved an accuracy of 0.98 and an F1 score of 0.8 in classifying SiSt transition. The mean absolute error in duration measurement of the correctly classified transitions was 0.047, and the SD was 0.07 seconds. These findings highlight the potential of the Smart Lacelock sensor for real-world fall-risk assessment and mobility monitoring in older adults.

[CV-115] Suppressing Non-Semantic Noise in Masked Image Modeling Representations CVPR2026

【速读】:该论文旨在解决掩码图像建模(Masked Image Modeling, MIM)方法在自监督学习中所导致的表征保留非语义信息的问题,这种非语义信息会损害下游任务推理时的性能。其解决方案的关键在于提出一种模型无关的语义不变性评分(semantic invariance score),该评分基于对真实与合成非语义图像进行主成分分析(Principal Component Analysis, PCA)得到;并在此基础上设计了一种名为语义正交伪影投影(Semantically Orthogonal Artifact Projection, SOAP)的后处理抑制方法,通过一个单一线性头直接抑制patch表示中的非语义信息,从而在无需额外训练的情况下显著提升多种MIM模型的零样本性能。

链接: https://arxiv.org/abs/2604.00172
作者: Martine Hjelkrem-Tan,Marius Aasan,Rwiddhi Chakraborty,Gabriel Y. Arteaga,Changkyu Choi,Adín Ramírez Rivera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR 2026

点击查看摘要

Abstract:Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

[CV-116] Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(VLM)在实际应用中难以建立可靠文本锚点(text anchors)的问题,即准确地将查询的文本与其对应的图像空间区域进行对齐。现有通用和专用于光学字符识别(OCR)的VLMs在细粒度文本-区域接地(text-region grounding)任务上表现不佳,限制了其在真实场景视觉问答(VQA)中的推理能力。解决方案的关键在于提出Q-Mask框架,其核心是因果查询驱动的掩码解码器(Causal Query-driven Mask Decoder, CQMD),通过类链式思维(chain-of-thought, CoT)的因果视觉解码机制,先生成与查询条件相关的视觉掩码以明确文本位置,再输出最终OCR结果。这一“先定位后识别”的视觉CoT范式实现了文本锚点的显式构建,显著提升了文本定位精度与理解稳定性。

链接: https://arxiv.org/abs/2604.00161
作者: Longwei Xu,Feng Feng,Shaojie Zhang,Xin Chen,Hang Li,Anan Du,Hailong Yu,Pei Fu,Zhenbo Luo,Jian Luan
机构: Xiaomi Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.

[CV-117] RawGen: Learning Camera Raw Image Generation

【速读】:该论文旨在解决低层次视觉任务中高质量原始图像(raw data)数据集稀缺的问题,特别是现有数据集受限于特定相机硬件且规模有限。其关键解决方案是提出RawGen——首个基于扩散模型的文本到原始图像生成框架,能够为任意目标相机生成物理一致的线性表示(如CIE XYZ或相机特定的raw格式),并支持从sRGB图像到原始图像的逆ISP(Image Signal Processing)重建。RawGen通过在潜在空间和像素空间引入专门设计的处理机制,利用大规模sRGB扩散模型的生成先验,结合一个“多对一”的逆ISP数据集(其中多个不同ISP参数渲染的sRGB图像锚定于同一场景参考),实现对未知和多样ISP流水线的有效建模与反演,从而显著优于假设固定ISP的传统方法,并可扩展用于增强下游低层次视觉任务的训练数据。

链接: https://arxiv.org/abs/2604.00093
作者: Dongyoung Kim,Junyong Lee,Abhijith Punnappurath,Mahmoud Afifi,Sangmin Han,Alex Levinshtein,Michael S. Brown
机构: 1: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity – however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen’s superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen’s scalable, text-driven synthetic data can benefit downstream low-level vision tasks.

[CV-118] Generalizable Dense Reward for Long-Horizon Robotic Tasks

【速读】:该论文旨在解决现有机器人基础策略模型在长时任务中因分布偏移和误差累积而导致性能下降的问题,以及强化学习(Reinforcement Learning, RL)在跨任务泛化时需依赖人工设计奖励函数的局限性。解决方案的关键在于提出一种密集奖励框架 VLLR,其核心由两部分构成:一是利用大语言模型(Large Language Models, LLMs)将任务分解为可验证的子任务,并通过视觉-语言模型(Vision-Language Models, VLMs)估计任务进展以初始化价值函数,从而避免全训练过程中的高推理成本;二是引入基于策略自置信度的内在奖励,在PPO微调过程中提供每步的内在指导。实验证明,VLM初始化显著提升任务完成效率,而自置信度奖励则大幅提升成功率,尤其在分布外任务上表现突出。

链接: https://arxiv.org/abs/2604.00055
作者: Silong Yong,Stephen Sheng,Carl Qi,Xiaojie Wang,Evan Sheehan,Anurag Shivaprasad,Yaqi Xie,Katia Sycara,Yesh Dattatreya
机构: Carnegie Mellon University (卡内基梅隆大学); Amazon Robotics (亚马逊机器人); UT Austin (得克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to 10% gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in this https URL

[CV-119] Semantic Audio-Visual Navigation in Continuous Environments CVPR2026

【速读】:该论文旨在解决现有音频-视觉导航(Audio-Visual Navigation, AVN)方法在连续环境中的局限性,即依赖预计算的房间脉冲响应(Room Impulse Response, RIR)导致代理只能在离散网格位置移动,从而产生空间不连续的感知输入。为建立更贴近现实的场景,作者提出了语义音频-视觉导航连续环境(Semantic Audio-Visual Navigation in Continuous Environments, SAVN-CE),其中代理可在3D空间中自由移动并感知时序与空间一致的音视频流,且目标可能间歇性静音或完全停止发声,造成目标信息丢失。解决方案的关键在于提出MAGNet模型——一种基于多模态Transformer的架构,能够联合编码空间和语义目标表征,并融合历史上下文与自运动线索,实现增强记忆的目标推理能力,从而在动态声源条件下提升导航成功率。

链接: https://arxiv.org/abs/2603.19660
作者: Yichen Zeng,Hebaixu Wang,Meng Liu,Yu Zhou,Chen Gao,Kehan Chen,Gongping Huang
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); Shandong Jianzhu University (山东建筑大学); Nankai University (南开大学); Tsinghua University (清华大学); CASIA (中国科学院自动化研究所); UCAS (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at this https URL.

[CV-120] AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

【速读】:该论文旨在解决在临床环境中部署大型基础模型进行胸部X光(Chest X-ray, CXR)分割时面临的计算资源受限问题。解决方案的关键在于提出一种两阶段微调框架AdaLoRA-QAT,其核心创新包括:1)自适应低秩编码器适配(adaptive low-rank encoder adaptation),实现参数高效微调;2)选择性混合精度INT8量化感知训练(selective mixed-precision INT8 quantization),在保持结构保真度的同时显著压缩模型尺寸。该方法在大规模CXR数据集上实现了与全精度SAM解码器微调相当的95.6% Dice分数,同时将可训练参数减少16.6倍、模型压缩2.24倍,且量化未显著降低分割准确性(Wilcoxon符号秩检验确认),从而有效平衡了精度、效率与临床可靠性,使紧凑且可部署的基础模型成为可能。

链接: https://arxiv.org/abs/2604.01167
作者: Prantik Deb,Srimanth Dhondy,N. Ramakrishna,Anu Kapoor,Raju S. Bapi,Tapabrata Chakraborti
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2026(Oral Presentation)

点击查看摘要

Abstract:Chest X-ray (CXR) segmentation is an important step in computer-aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA-QAT, a two-stage fine-tuning framework that combines adaptive low-rank encoder adaptation with full quantization-aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed-precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large-scale CXR datasets, AdaLoRA-QAT achieves 95.6% Dice, matching full-precision SAM decoder fine-tuning while reducing trainable parameters by 16.6\times and yielding 2.24\times model compression. A Wilcoxon signed-rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA-QAT effectively balances accuracy, efficiency, and structural trust-worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: this https URL

[CV-121] AI-assisted Human-in-the-Loop Web Platform for Structural Characterization in Hard drive design

【速读】:该论文旨在解决半导体材料纳米级表征中自动化与灵活性之间的矛盾问题:传统刚性自动化流程对样品变异敏感,而纯人工分析则效率低且主观性强。解决方案的关键在于提出一种可调谐的人机协同工作流框架(tunable human-AI-assisted workflow framework),其核心是将基于梯度的峰值检测算法与交互式修正模块相结合,在设计阶段引入人类专家干预以确保适应性,同时在实际分析过程中实现全自动执行。该框架通过Web界面直接处理TEM/EMD文件,集成去噪和界面追踪算法,输出具有纳米级精度的层厚与界面粗糙度统计指标,从而实现了可复用、可扩展的标准化分析流程,有效融合了人类洞察力与机器精度。

链接: https://arxiv.org/abs/2604.00359
作者: Utkarsh Pratiush,Huaixun Huyan,Maryam Zahiri Azar,Esmeralda Yitamben,Allen Bourez,Sergei V Kalinin,Vasfi Burak Ozdol
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scanning transmission electron microscopy (STEM) has become a cornerstone instrument for semiconductor materials metrology, enabling nanoscale analysis of complex multilayer structures that define device performance. Developing effective metrology workflows for such systems requires balancing automation with flexibility; rigid pipelines are brittle to sample variability, while purely manual approaches are slow and subjective. Here, we present a tunable human-AI-assisted workflow framework that enables modular and adaptive analysis of STEM images for device characterization. As an illustrative example, we demonstrate a workflow for automated layer thickness and interface roughness quantification in multilayer thin films. The system integrates gradient-based peak detection with interactive correction modules, allowing human input at the design stage while maintaining fully automated execution across samples. Implemented as a web-based interface, it processes TEM/EMD files directly, applies noise reduction and interface tracking algorithms, and outputs statistical roughness and thickness metrics with nanometer precision. This architecture exemplifies a general approach toward adaptive, reusable metrology workflows - bridging human insight and machine precision for scalable, standardized analysis in semiconductor manufacturing. The code is made available at this https URL

[CV-122] Feature-level Site Leakage Reduction for Cross-Hospital Chest X-ray Transfer via Self-Supervised Learning

【速读】:该论文旨在解决跨医院胸部X光(chest X-ray)模型性能下降的问题,其核心在于量化并理解“站点泄漏”(site leakage)对迁移学习方法效果的影响。传统研究常假设特征在不同医院间具有不变性(invariance),但缺乏直接测量手段。论文的关键解决方案是引入后验线性探测(post hoc linear probe)来量化从冻结的骨干网络特征 $ f $ 和投影特征 $ z $ 中预测采集站点的能力,从而揭示不同迁移策略(如多站点自监督学习(multi-site self-supervised learning, SSL)与特征级对抗站点混淆(feature-level adversarial site confusion))的实际表现差异。实验表明,多站点SSL显著提升目标医院RSNA上的肺炎分类AUC(从0.6736 ± 0.0148升至0.7804 ± 0.0197),且降低站点泄漏;而对抗混淆虽减少泄漏指标,却未稳定提升性能并增加方差,说明单纯追求特征不变性可能无法有效改善泛化能力,必须结合泄漏测量以正确评估迁移方法的有效性。

链接: https://arxiv.org/abs/2604.00263
作者: Ayoub Louaye Bouaziz,Lokmane Chebouba
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The 7th International Conference on Computing Systems and Applications [Algiers,2026]

点击查看摘要

Abstract:Cross-hospital failure in chest X-ray models is often attributed to domain shift, yet most work assumes invariance without measuring it. This paper studies how to measure site leakage directly and how that measurement changes conclusions about transfer methods. We study multi-site self-supervised learning (SSL) and feature-level adversarial site confusion for cross-hospital transfer. We pretrain a ResNet-18 on NIH and CheXpert without pathology labels. We then freeze the encoder and train a linear pneumonia classifier on NIH only, evaluating transfer to RSNA. We quantify site leakage using a post hoc linear probe that predicts acquisition site from frozen backbone features f and projection features z . Across 3 random seeds, multi-site SSL improves RSNA AUC from 0.6736 \pm 0.0148 (ImageNet initialization) to 0.7804 \pm 0.0197. Adding adversarial site confusion on f reduces measured leakage but does not reliably improve AUC and increases variance. On f , site probe accuracy drops from 0.9890 \pm 0.0021 (SSL-only) to 0.8504 \pm 0.0051 (CanonicalF), where chance is 0.50. On z , probe accuracy drops from 0.8912 \pm 0.0092 to 0.7810 \pm 0.0250. These results show that measuring leakage changes how transfer methods should be interpreted: multi-site SSL drives transfer, while adversarial confusion exposes the limits of invariance assumptions.

[CV-123] Pupil Design for Computational Wavefront Estimation

【速读】:该论文旨在解决如何在仅通过单次强度测量的情况下,实现对入射波前(incident wavefront)的精确恢复问题,这是自适应光学、全息成像、计算显微术及非视距成像等新兴应用中的关键技术挑战。其解决方案的关键在于提出了一种定量的瞳孔不对称性度量(quantitative asymmetry metric),并通过大规模仿真与光学实验验证了增加瞳孔设计的不对称性可显著提升波前恢复的可实现性,同时系统分析了光通量与噪声环境下的性能权衡关系。

链接: https://arxiv.org/abs/2604.00225
作者: Ali Almuallem,Nicholas Chimitt,Bole Ma,Qi Guo,Stanley H. Chan
机构: Purdue University (普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Establishing a precise connection between imaged intensity and the incident wavefront is essential for emerging applications in adaptive optics, holography, computational microscopy, and non-line-of-sight imaging. While prior work has shown that breaking symmetries in pupil design enables wavefront recovery from a single intensity measurement, there is little guidance on how to design a pupil that improves wavefront estimation. In this work we introduce a quantitative asymmetry metric to bridge this gap and, through an extensive empirical study and supporting analysis, demonstrate that increasing asymmetry enhances wavefront recoverability. We analyze the trade-offs in pupil design, and the impact on light throughput along with performance in noise. Both large-scale simulations and optical bench experiments are carried out to support our findings.

[CV-124] Brain MR Image Synthesis with Multi-contrast Self-attention GAN

【速读】:该论文旨在解决多模态磁共振成像(MRI)在神经肿瘤评估中因获取所有对比度(如T1增强、T1非增强、T2、T2FLAIR)耗时、成本高及患者不适而难以全面实施的问题,从而限制了肿瘤的完整评估。其核心解决方案是提出一种统一的3D多对比合成框架——3D-MC-SAGAN,该框架通过单一T2输入生成缺失的多模态图像(T2f、T1n、T1c),关键创新在于引入多尺度3D编码器-解码器结构与新型记忆受限混合注意力(Memory-Bounded Hybrid Attention, MBHA)模块以高效捕捉长程依赖关系,并结合冻结的3D U-Net分割模块引入分割一致性约束,确保病灶形态保留;同时采用包含对抗损失、重建损失、感知损失、结构相似性损失、对比分类损失和分割引导损失的复合目标函数,实现全局真实感与肿瘤结构保真的协同优化,最终在多个脑部MRI数据集上达到最优定量性能并保持与全模态采集相当的肿瘤分割精度。

链接: https://arxiv.org/abs/2604.00070
作者: Zaid A. Abod,Furqan Aziz
机构: University of Leicester (莱斯特大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Note: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurate and complete multi-modal Magnetic Resonance Imaging (MRI) is essential for neuro-oncological assessment, as each contrast provides complementary anatomical and pathological information. However, acquiring all modalities (e.g., T1c, T1n, T2, T2f) for every patient is often impractical due to time, cost, and patient discomfort, potentially limiting comprehensive tumour evaluation. We propose 3D-MC-SAGAN (3D Multi-Contrast Self-Attention generative adversarial network), a unified 3D multi-contrast synthesis framework that generates high-fidelity missing modalities from a single T2 input while explicitly preserving tumour characteristics. The model employs a multi-scale 3D encoder-decoder generator with residual connections and a novel Memory-Bounded Hybrid Attention (MBHA) block to capture long-range dependencies efficiently, and is trained with a WGAN-GP critic and an auxiliary contrast-conditioning branch to produce T2f, T1n, and T1c volumes within a single unified network. A frozen 3D U-Net-based segmentation module introduces a segmentation-consistency constraint to preserve lesion morphology. The composite objective integrates adversarial, reconstruction, perceptual, structural similarity, contrast-classification, and segmentation-guided losses to align global realism with tumour-preserving structure. Extensive evaluation on 3D brain MRI datasets demonstrates that 3D-MC-SAGAN achieves state-of-the-art quantitative performance and generates visually coherent, anatomically plausible contrasts with improved distribution-level realism. Moreover, it maintains tumour segmentation accuracy comparable to fully acquired multi-modal inputs, highlighting its potential to reduce acquisition burden while preserving clinically meaningful information.

人工智能

[AI-0] he Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

【速读】:该论文旨在解决当前人工智能天气预测(AI weather prediction)领域缺乏统一数学框架来解释预报技能决定因素的问题。现有理论仅关注特定架构选择,而忽视了训练方法、损失函数设计与数据多样性等关键环节对模型性能的影响。其解决方案的核心在于构建一个融合球面逼近论、动力系统理论、信息论与统计学习理论的完整学习管道(learning pipeline)分析框架,首次将架构、损失函数、训练策略和数据分布视为整体进行建模。关键突破包括:提出学习管道误差分解,证明估计误差(由损失函数和数据决定)在当前规模下主导近似误差(由架构决定);建立损失函数谱理论,揭示均方误差(MSE)诱导的球谐坐标系下的频谱模糊现象;推导出分布外外推边界,证明数据驱动模型会系统性低估极端事件,且偏差随极端程度线性增长。实证部分通过十种架构的AI天气模型验证了上述理论预测,为未来模型设计提供可数学评估的前瞻性指导。

链接: https://arxiv.org/abs/2604.01215
作者: Piyush Garg,Diana R. Gergel,Andrew E. Shao,Galen J. Yacalis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

[AI-1] CliffSearch: Structured Agent ic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)引导的科学算法发现系统中普遍存在的一类问题:即仅优化代码级产物,缺乏对科学结构的充分建模,导致生成结果在正确性(correctness)和原创性(originality)方面代表性不足。为此,作者提出 CliffSearch——一个基于代理的进化框架,其核心创新在于将进化操作(选择、交叉、变异与评审)全部实现为LLM代理,并围绕三个原则构建迭代循环:(1) 每个节点均为结构化的科学产物(理论+代码或纯代码模式),(2) 评审者的正确性与原创性判断作为与任务指标同等重要的筛选门控机制,(3) 变异分为探索路径与修正路径:前者引入邻近科学领域知识以提升新颖性,后者则利用评审反馈(包括理论、代码、基准结果及运行时错误)进行目标导向的修复。该设计使框架能够在受控的探索条件下实现可重复的发现流程,同时优先保障科学可解释性和正确性,而非单纯追求候选解的吞吐量。

链接: https://arxiv.org/abs/2604.01210
作者: Youssef Mroueh,Carlos Fonseca,Brian Belgodere,David Cox
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at this https URL .

[AI-2] herefore I am. I Think

【速读】:该论文试图解决的问题是:大型语言推理模型在做出决策时,究竟是先进行思考再决定(即“思考后决策”),还是先做出决策再生成推理过程(即“决策后思考”)。这一问题对于理解模型的内部认知机制至关重要。解决方案的关键在于发现并验证了早期激活状态中已编码可检测的决策信息——通过简单的线性探测器,能够以高置信度从生成前的激活向量中解码出工具调用等决策,并且在某些情况下甚至早于任何推理 token 的生成;进一步地,激活操控实验表明扰动决策方向会显著改变模型行为(在不同模型和基准上变化范围为7%–79%),且当决策被操控时,推理链通常会合理化这一改变而非抵抗,这表明模型在正式推理开始前已存在隐式的决策表征。

链接: https://arxiv.org/abs/2604.01202
作者: Esakkivel Esakkiraja,Sai Rajeswar,Denis Akhiyarov,Rajagopal Venkatesaramani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.

[AI-3] Adversarial Moral Stress Testing of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际软件系统中部署时面临的伦理鲁棒性评估难题,尤其是面对持续对抗性用户交互下可能出现的行为不稳定问题。现有安全基准多依赖单轮评估和聚合指标(如毒性分数与拒答率),难以捕捉多轮交互中的罕见但高影响的伦理失败及渐进式性能退化现象。解决方案的关键在于提出一种基于压力测试的评估框架——对抗性道德压力测试(Adversarial Moral Stress Testing, AMST),其通过结构化的压力变换对提示词施加扰动,并利用分布感知的鲁棒性指标(包括方差、尾部风险和时间行为漂移)来量化模型在多轮交互中的稳定性表现。AMST不仅揭示了不同模型间显著的鲁棒性差异,还表明模型鲁棒性更依赖于分布稳定性和尾部行为而非平均性能,从而为LLM驱动系统的可信赖部署提供了可扩展、模型无关的评估与监控手段。

链接: https://arxiv.org/abs/2604.01108
作者: Saeid Jamshidi,Foutse Khomh,Arghavan Moradi Dakhel,Amin Nikanjam,Mohammad Hamdaqa,Kawser Wazed Nafi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk, and temporal behavioral drift across interaction rounds. We evaluate AMST on several state-of-the-art LLMs, including LLaMA-3-8B, GPT-4o, and DeepSeek-v3, using a large set of adversarial scenarios generated under controlled stress conditions. The results demonstrate substantial differences in robustness profiles across models and expose degradation patterns that are not observable under conventional single-round evaluation protocols. In particular, robustness has been shown to depend on distributional stability and tail behavior rather than on average performance alone. Additionally, AMST provides a scalable and model-agnostic stress-testing methodology that enables robustness-aware evaluation and monitoring of LLM-enabled software systems operating in adversarial environments.

[AI-4] Approximating Pareto Frontiers in Stochastic Multi-Objective Optimization via Hashing and Randomization

【速读】:该论文旨在解决随机多目标优化(Stochastic Multi-Objective Optimization, SMOO)问题,其核心挑战在于在不确定环境中权衡多个可能冲突的目标,并准确识别帕累托前沿(Pareto frontier),而传统方法因嵌入的概率推理(如边缘概率、后验概率或期望计算)常面临计算复杂度高或近似精度不足的问题。解决方案的关键是提出XOR-SMOO算法,该算法通过仅需多项式对数次数地调用SAT oracle,在概率 1δ1-\delta 下获得 γ\gamma-近似帕累托前沿(即真实前沿的乘法因子误差不超过 γ\gamma),从而在#P\#P-难问题中实现紧致且常数因子的近似保证,显著提升了求解器的实用性与可靠性。

链接: https://arxiv.org/abs/2604.01098
作者: Jinzhao Li,Nan Jiang,Yexiang Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Stochastic Multi-Objective Optimization (SMOO) is critical for decision-making trading off multiple potentially conflicting objectives in uncertain environments. SMOO aims at identifying the Pareto frontier, which contains all mutually non-dominating decisions. The problem is highly intractable due to the embedded probabilistic inference, such as computing the marginal, posterior probabilities, or expectations. Existing methods, such as scalarization, sample average approximation, and evolutionary algorithms, either offer arbitrarily loose approximations or may incur prohibitive computational costs. We propose XOR-SMOO, a novel algorithm that with probability 1-\delta , obtains \gamma -approximate Pareto frontiers ( \gamma1 ) for SMOO by querying an SAT oracle poly-log times in \gamma and \delta . A \gamma -approximate Pareto frontier is only below the true frontier by a fixed, multiplicative factor \gamma . Thus, XOR-SMOO solves highly intractable SMOO problems (#P-hard) with only queries to SAT oracles while obtaining tight, constant factor approximation guarantees. Experiments on real-world road network strengthening and supply chain design problems demonstrate that XOR-SMOO outperforms several baselines in identifying Pareto frontiers that have higher objective values, better coverage of the optimal solutions, and the solutions found are more evenly distributed. Overall, XOR-SMOO significantly enhanced the practicality and reliability of SMOO solvers.

[AI-5] VibeGuard: A Security Gate Framework for AI-Generated Code

【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发中广泛应用所引入的新型安全漏洞问题,特别是“vibe coding”模式下因自动化代码生成和打包配置错误导致的敏感信息泄露(如源码映射文件暴露)及供应链风险。现有静态分析与密钥扫描工具无法覆盖此类漏洞,因其本质源于非传统逻辑缺陷,而是由构建流程中的配置漂移或 artifact 卫生问题引发。解决方案的关键在于提出 VibeGuard——一个预发布安全门控机制,专门针对五类盲区进行检测:artifact 卫生、打包配置漂移、源码映射暴露、硬编码密钥和供应链风险;通过在受控实验中对八个合成项目(七种漏洞场景,一种干净对照)实现 100% 召回率、89.47% 精确率(F1=94.44%),验证了其有效性和可部署性,从而为依赖 AI 生成代码的团队提供纵深防御的工作流支撑。

链接: https://arxiv.org/abs/2604.01052
作者: Ying Xie
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:“Vibe coding,” in which developers delegate code generation to AI assistants and accept the output with little manual review, has gained rapid adoption in production settings. On March 31, 2026, Anthropic’s Claude Code CLI shipped a 59.8 MB source map file in its npm package, exposing roughly 512,000 lines of proprietary TypeScript. The tool had itself been largely vibe-coded, and the leak traced to a misconfigured packaging rule rather than a logic bug. Existing static-analysis and secret-scanning tools did not cover this failure mode, pointing to a gap between the vulnerabilities AI tends to introduce and the vulnerabilities current tooling is built to find. We present VibeGuard, a pre-publish security gate that targets five such blind spots: artifact hygiene, packaging-configuration drift, source-map exposure, hardcoded secrets, and supply-chain risk. In controlled experiments on eight synthetic projects (seven vulnerable, one clean control), VibeGuard achieved 100% recall, 89.47% precision (F1 = 94.44%), and correct pass/fail gate decisions on all eight projects across three policy levels. We discuss how these results inform a defense-in-depth workflow for teams that rely on AI code generation.

[AI-6] Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery

【速读】:该论文旨在解决对抗性攻击对基于深度强化学习(Deep Reinforcement Learning, DRL)的无线接入网(Radio Access Network, RAN)切片决策的影响问题,特别是预算受限的敌手通过选择性干扰切片传输来诱导资源分配偏差,进而导致服务等级协议(Service Level Agreement, SLA)违规。其解决方案的关键在于量化此类攻击引发的稳态SLA违规程度及DRL代理在攻击后恢复行为,揭示了即使攻击停止,DRL策略仍需经历显著恢复期才能收敛至正常性能水平,凸显了AI驱动RAN切片系统在安全性方面的脆弱性与恢复机制设计的重要性。

链接: https://arxiv.org/abs/2604.01049
作者: Deemah H. Tashman,Soumaya Cherkaoui
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Next-generation (NextG) cellular networks are designed to support emerging applications with diverse data rate and latency requirements, such as immersive multimedia services and large-scale Internet of Things deployments. A key enabling mechanism is radio access network (RAN) slicing, which dynamically partitions radio resources into virtual resource blocks to efficiently serve heterogeneous traffic classes, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). In this paper, we study the impact of adversarial attacks on AI-driven RAN slicing decisions, where a budget-constrained adversary selectively jams slice transmissions to bias deep reinforcement learning (DRL)-based resource allocation, and quantify the resulting service level agreement (SLA) violations and post-attack recovery behavior. Our results indicate that budget-constrained adversarial jamming can induce severe and slice-dependent steady-state SLA violations. Moreover, the DRL agent’s reward converges toward the clean baseline only after a non-negligible recovery period.

[AI-7] Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中系统指令(System Instructions)泄露的安全风险问题,尤其关注攻击者通过重构查询方式(如编码或结构化输出任务)绕过拒绝式指令防御机制,从而提取敏感信息(如API密钥、内部策略等)的漏洞。解决方案的关键在于提出了一种自动化评估框架,验证了当前基于拒绝响应的防护策略在面对结构化序列化请求时存在显著失效风险,并进一步提出一种无需模型重训练的缓解策略:利用思维链(Chain-of-Thought)推理模型对系统指令进行一次性语义重塑(one-shot instruction reshaping),通过细微的措辞和结构调整显著降低攻击成功率。

链接: https://arxiv.org/abs/2604.01039
作者: Anubhab Sahu,Diptisha Samanta,Reza Soosahabi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates ( 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

[AI-8] Fast and Accurate Probing of In-Training LLM s Downstream Performances

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中对下游任务性能进行评估时面临的计算成本过高与评估效率低下问题。传统生成式评估方法因延迟高(约1小时)难以支撑高效迭代,而仅依赖训练损失(如困惑度)又无法准确反映下游任务表现,导致评估结果不可靠。解决方案的关键在于提出一种轻量级的“探针”(probe)机制:利用训练中模型检查点(checkpoints)的内部表示作为输入,直接预测其在下游任务上的成功概率(pass@1),从而实现高效且准确的在训练期间评估。该方法显著降低计算延迟至约3分钟,同时保持较高预测准确性(平均AUROC达0.75),并具备跨检查点的良好泛化能力,为LLM开发提供了可扩展、敏捷且数据驱动的评估新范式。

链接: https://arxiv.org/abs/2604.01025
作者: Zhichen Liu,Tianle Lun,Zhibin Wen,Hao An,Yulin Ou,Jianhui Xu,Hao Zhang,Wenyi Fang,Yang Zheng,Yang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paradigm of scaling Large Language Models (LLMs) in both parameter size and test time has pushed the boundaries of AI capabilities, but at the cost of making the traditional generative evaluation paradigm prohibitively expensive, therefore making the latency of LLM’s in-training downstream performance evaluation unbearable. However, simple metrics like training loss (perplexity) are not always correlated with downstream performance, as sometimes their trends diverge from the actual task outcomes. This dilemma calls for a method that is computationally efficient and sufficiently accurate in measuring model capabilities. To address this challenge, we introduce a new in-training evaluation paradigm that uses a lightweight probe for monitoring downstream performance. The probes take the internal representations of LLM checkpoints (during training) as input and directly predict the checkpoint’s performance on downstream tasks measured by success probability (i.e., pass@1). We design several probe architectures, validating their effectiveness using the OLMo3-7B’s checkpoints across a diverse set of downstream tasks. The probes can accurately predict a checkpoint’s performance (with avg. AUROC 0.75), have decent generalizability across checkpoints (earlier predicts later), and reduce the computation latency from \sim 1 hr (using conventional generative evaluation method) to \sim 3 min. In sum, this work presents a practical and scalable in-training downstream evaluation paradigm, enabling a more agile, informed, and efficient LLM development process.

[AI-9] ransfer learning for nonparametric Bayesian networks

【速读】:该论文旨在解决在数据稀缺条件下学习非参数贝叶斯网络(nonparametric Bayesian networks)时性能下降的问题。其核心挑战在于如何有效利用源域知识进行迁移学习,同时避免负迁移(negative transfer)带来的负面影响。解决方案的关键在于提出两种迁移学习算法:基于约束的结构学习方法PC-stable-transfer learning (PCS-TL) 和基于评分的优化方法hill climbing transfer learning (HC-TL),并分别为二者设计特定指标以抑制负迁移;此外,在参数估计阶段采用对数线性池化(log-linear pooling)策略整合源域与目标域的信息。实验表明,这些方法显著提升了小样本场景下非参数贝叶斯网络的学习性能,具有实际工业部署价值。

链接: https://arxiv.org/abs/2604.01021
作者: Rafael Sojo,Pedro Larrañaga,Concha Bielza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: An earlier version was previously posted on SSRN. This version includes improvements in experiments and evaluation metrics following reviewer comments. Revision submitted to Knowledge-Based Systems

点击查看摘要

Abstract:This paper introduces two transfer learning methodologies for estimating nonparametric Bayesian networks under scarce data. We propose two algorithms, a constraint-based structure learning method, called PC-stable-transfer learning (PCS-TL), and a score-based method, called hill climbing transfer learning (HC-TL). We also define particular metrics to tackle the negative transfer problem in each of them, a situation in which transfer learning has a negative impact on the model’s performance. Then, for the parameters, we propose a log-linear pooling approach. For the evaluation, we learn kernel density estimation Bayesian networks, a type of nonparametric Bayesian network, and compare their transfer learning performance with the models alone. To do so, we sample data from small, medium and large-sized synthetic networks and datasets from the UCI Machine Learning repository. Then, we add noise and modifications to these datasets to test their ability to avoid negative transfer. To conclude, we perform a Friedman test with a Bergmann-Hommel post-hoc analysis to show statistical proof of the enhanced experimental behavior of our methods. Thus, PCS-TL and HC-TL demonstrate to be reliable algorithms for improving the learning performance of a nonparametric Bayesian network with scarce data, which in real industrial environments implies a reduction in the required time to deploy the network.

[AI-10] OmniMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

【速读】:该论文旨在解决AI代理在长期运行中面临的多模态经验存储、组织与回忆能力不足的问题,这是制约其在长时间尺度上有效运作的关键瓶颈。解决方案的核心在于构建一个统一的多模态终身记忆框架OmniMem,该框架通过部署自主研究流水线(autonomous research pipeline)进行自动化探索与优化,而非依赖人工调参或传统AutoML方法。关键突破在于:系统通过自动诊断失败模式、修复数据流水线缺陷(+175%性能提升)、调整架构设计(+44%)以及改进提示工程(特定类别+188%),显著超越了超参数调优的贡献,最终在LoCoMo和Mem-Gallery两个基准上分别实现F1分数提升411%和214%,验证了自主研究范式在复杂多模态记忆系统设计中的有效性。

链接: https://arxiv.org/abs/2604.01007
作者: Jiaqi Liu,Zipeng Ling,Shi Qiu,Yanqing Liu,Siwei Han,Peng Xia,Haoqin Tu,Zeyu Zheng,Cihang Xie,Charles Fleming,Mingyu Ding,Huaxiu Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover OmniMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naïve baseline (F1=0.117 on LoCoMo), the pipeline autonomously executes \sim50 experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117 \to 0.598) and +214% on Mem-Gallery (0.254 \to 0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains. Code is available at this this https URL.

[AI-11] Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)算法中策略参数化为对角高斯分布所导致的局限性问题,即无法有效建模多模态分布,从而难以覆盖多解问题中的全部最优解,并因仅用均值表示回报而丢失其分布特性,削弱了对策略更新的有效指导。解决方案的关键在于提出一种基于流匹配(flow matching)的分布强化学习算法(Flow-based Policy with Distributional RL, FP-DRL),该方法通过流模型高效地拟合复杂策略分布,并结合分布强化学习(Distributional RL)对完整回报分布进行建模与优化,从而更有效地引导多模态策略更新,显著提升智能体性能。

链接: https://arxiv.org/abs/2604.00977
作者: Ruijie Hao,Longfei Zhang,Yang Dai,Yang Ma,Xingxing Liang,Guangquan Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.

[AI-12] WARP: Guaranteed Inner-Layer Repair of NLP Transformers

【速读】:该论文旨在解决基于Transformer的自然语言处理(Natural Language Processing, NLP)模型在面对对抗性扰动时仍易受攻击的问题,尤其是现有修复方法存在根本性权衡:基于梯度的方法虽灵活但缺乏可验证性且易过拟合,而提供修复保证的方法则受限于仅能作用于最后一层或小型网络,严重限制了参数搜索空间。解决方案的关键在于提出WARP(Weight-Adjusted Repair with Provability),这是一个基于约束的修复框架,通过将修复问题建模为一个由对数间隙的一阶线性化导出的凸二次规划问题,实现了在高维参数空间中的可 tractable 优化。该方法在满足一阶近似成立的前提下,提供了三个逐样本保证:(i) 正margin约束确保修复后输入被正确分类,(ii) 保留集上的保持约束,以及(iii) 基于Lipschitz连续性的认证鲁棒半径。此外,引入基于敏感性的预处理步骤以适应不同模型架构,并证明在温和假设下迭代优化过程收敛至满足所有修复约束的解。实证结果表明,该方法在编码器-only 的Transformer上有效提升了对抗鲁棒性并保持理论保证。

链接: https://arxiv.org/abs/2604.00938
作者: Hsin-Ling Hsu,Min-Yu Chen,Nai-Chia Chen,Yan-Ru Chen,Yi-Ling Chang,Fang Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.

[AI-13] PsychAgent : An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

【速读】:该论文旨在解决当前生成式 AI 心理咨询系统(Generative AI Psychological Counselors)依赖静态对话数据集进行监督微调,导致其无法像人类专家那样通过持续的临床实践和经验积累来提升专业能力的问题。解决方案的关键在于提出一个经验驱动的终身学习代理(Experience-Driven Lifelong Learning Agent, \textttPsychAgent),其核心由三个引擎构成:一是面向长期多轮交互的记忆增强型规划引擎(Memory-Augmented Planning Engine),确保治疗过程的连贯性;二是技能演化引擎(Skill Evolution Engine),从历史咨询轨迹中提取实践导向的新技能;三是强化内化引擎(Reinforced Internalization Engine),通过拒绝微调将演化后的技能整合进模型,从而在多样场景下提升整体响应质量与一致性。

链接: https://arxiv.org/abs/2604.00931
作者: Yutao Yang,Junsong Li,Qianjun Pan,Jie Zhou,Kai Chen,Qin Chen,Jingyuan Zhao,Ningning Zhou,Xin Li,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\textttPsychAgent) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

[AI-14] Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time

【速读】:该论文旨在解决生成式 AI(Generative AI)在开源软件开发中日益增长的参与度所带来的影响问题,特别是其对代码质量、团队协作模式以及长期可维护性的影响。研究的关键在于构建了一个包含约11万条开源Pull Request的新型数据集,涵盖提交记录、评论、代码审查、问题和文件变更等多维度信息,从而系统比较五种主流编码代理(OpenAI Codex、Claude Code、GitHub Copilot、Google Jules 和 Devin)在合并频率、编辑文件类型及开发者交互信号等方面的差异,并通过纵向分析代理生成代码与人类编写代码的生存率和代码 churn 率,揭示其长期维护特性。结果表明,尽管AI代理活动持续增加,但其生成代码随时间变化的 churn 率更高,暗示其在工程实践中需更谨慎地集成到软件生命周期中。

链接: https://arxiv.org/abs/2604.00917
作者: Razvan Mihai Popescu,David Gros,Andrei Botocan,Rahul Pandita,Prem Devanbu,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MSR 2026 Technical Track

点击查看摘要

Abstract:The rise of large language models for code has reshaped software development. Autonomous coding agents, able to create branches, open pull requests, and perform code reviews, now actively contribute to real-world projects. Their growing role offers a unique and timely opportunity to investigate AI-driven contributions and their effects on code quality, team dynamics, and software maintainability. In this work, we construct a novel dataset of approximately 110,000 open-source pull requests, including associated commits, comments, reviews, issues, and file changes, collectively representing millions of lines of source code. We compare five popular coding agents, including OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin, examining how their usage differs in various development aspects such as merge frequency, edited file types, and developer interaction signals, including comments and reviews. Furthermore, we emphasize that code authoring and review are only a small part of the larger software engineering process, as the resulting code must also be maintained and updated over time. Hence, we offer several longitudinal estimates of survival and churn rates for agent-generated versus human-authored code. Ultimately, our findings indicate an increasing agent activity in open-source projects, although their contributions are associated with more churn over time compared to human-authored code.

[AI-15] Experience as a Compass: Multi-agent RAG agent RAG with Evolving Orchestration and Agent Prompts

【速读】:该论文旨在解决多智能体检索增强生成(Multi-agent Retrieval-Augmented Generation, Multi-agent RAG)系统中因静态代理行为和固定编排策略导致的性能脆弱性问题,尤其是在处理多样化、多跳复杂任务时表现不佳。其核心挑战在于缺乏持续自适应的编排机制以及个体代理在行为层面的学习能力。解决方案的关键在于提出HERA框架,该框架通过分层优化实现双维度进化:全局层面利用奖励引导采样与经验累积动态优化查询相关的代理拓扑结构;局部层面则通过角色感知提示演化(Role-Aware Prompt Evolution)机制,基于信用分配与操作-行为双轴适配来精细化每个代理的行为,从而实现角色条件下的针对性改进。实验表明,HERA在六个知识密集型基准上平均提升38.69%,同时保持良好的泛化能力和token效率,并揭示出稀疏探索可催生紧凑且高效的多智能体网络结构。

链接: https://arxiv.org/abs/2604.00901
作者: Sha Li,Naren Ramakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi-hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior-level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts. At the global level, HERA optimizes query-specific agent topologies through reward-guided sampling and experience accumulation. At the local level, Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, enabling targeted, role-conditioned improvements. On six knowledge-intensive benchmarks, HERA achieves an average improvement of 38.69% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self-organization, where sparse exploration yields compact, high-utility multi-agent networks, demonstrating both efficient coordination and robust reasoning.

[AI-16] Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

【速读】:该论文旨在解决现有测试时学习(Test-Time Learning, TTL)方法中适应策略(adaptation policy)依赖人工设计、缺乏对下游任务优化的问题。其核心挑战在于如何自动发现能够提升智能体在多轮交互中纠错能力的通用适应策略,而非依赖人类直觉手动构造固定策略。解决方案的关键在于提出Meta-TTL框架,将适应策略的学习建模为一个双层优化问题:内层执行标准TTL过程以评估候选策略在连续episode中的纠错效果,外层则通过进化搜索在多样化的训练任务分布上迭代优化适应策略,从而获得具有迁移性的可泛化适应机制。

链接: https://arxiv.org/abs/2604.00830
作者: Zhanzhi Lou,Hui Chen,Yibo Li,Qian Wang,Bryan Hooi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent’s performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.

[AI-17] Preference Guided Iterated Pareto Referent Optimisation for Accessible Route Planning

【速读】:该论文旨在解决城市路径规划中如何满足不同用户群体的可达性需求与偏好差异的问题,尤其关注多目标优化场景下用户交互效率与计算性能之间的平衡。其解决方案的关键在于提出了一种偏好引导的迭代帕累托参考点优化算法(Preference Guided Iterated Pareto Referent Optimisation, PG-IPRO),通过允许用户在每次迭代中对当前推荐路径提供反馈(如指定需进一步最小化或放松某一目标),实现直观且高效的用户交互;同时,由于该算法采用迭代式搜索机制而非完整计算帕累托前沿(Pareto front),显著提升了计算效率并缩短了用户等待时间。

链接: https://arxiv.org/abs/2604.00795
作者: Paolo Speziali,Arno De Greef,Mehrdad Asadi,Willem Röpke,Ann Nowé,Diederik M. Roijers
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose the Preference Guided Iterated Pareto Referent Optimisation (PG-IPRO) for urban route planning for people with different accessibility requirements and preferences. With this algorithm the user can interact with the system by giving feedback on a route, i.e., the user can say which objective should be further minimized, or conversely can be relaxed. This leads to intuitive user interaction, that is especially effective during early iterations compared to information-gain-based interaction. Furthermore, due to PG-IPRO’s iterative nature, the full set of alternative, possibly optimal policies (the Pareto front), is never computed, leading to higher computational efficiency and shorter waiting times for users.

[AI-18] RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在解决编程竞赛(Competitive Programming, CP)问题时,现有方法多局限于单次尝试设置,忽视了模型通过迭代优化实现自我改进的能力这一问题。其解决方案的关键在于提出一种名为RefineRL的新方法,核心创新包括:(1) Skeptical-Agent——一个具备本地执行工具的迭代式自我精炼代理,能够对生成的解法进行公共测试用例验证,并始终保持对自身输出的怀疑态度,从而在即使验证通过的情况下也强制实施严格自 refinements;(2) 一种基于强化学习(Reinforcement Learning, RL)的策略,仅使用标准RLVR数据(即配对的题目与可验证答案)来激励LLM进行自我精炼。实验表明,该方法显著提升了小型模型(如4B参数规模)的性能,使其超越更大模型(32B),并接近超大规模模型(235B)的单次尝试表现,证明了自我精炼在提升LLM推理能力方面的巨大潜力。

链接: https://arxiv.org/abs/2604.00790
作者: Shaopeng Fu,Xingxing Zhang,Li Dong,Di Wang,Furu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated strong performance on complex reasoning tasks such as competitive programming (CP), existing methods predominantly focus on single-attempt settings, overlooking their capacity for iterative refinement. In this paper, we present RefineRL, a novel approach designed to unleash the self-refinement capabilities of LLMs for CP problem solving. RefineRL introduces two key innovations: (1) Skeptical-Agent, an iterative self-refinement agent equipped with local execution tools to validate generated solutions against public test cases of CP problems. This agent always maintains a skeptical attitude towards its own outputs and thereby enforces rigorous self-refinement even when validation suggests correctness. (2) A reinforcement learning (RL) solution to incentivize LLMs to self-refine with only standard RLVR data (i.e., problems paired with their verifiable answers). Extensive experiments on Qwen3-4B and Qwen3-4B-2507 demonstrate that our method yields substantial gains: after our RL training, these compact 4B models integrated with the Skeptical-Agent not only outperform much larger 32B models but also approach the single-attempt performance of 235B models. These findings suggest that self-refinement holds considerable promise for scaling LLM reasoning, with significant potential for further advancement.

[AI-19] UK AISI Alignment Evaluation Case-Study

【速读】:该论文旨在解决前沿人工智能(AI)系统在实际部署中是否可能违背设计意图、尤其是 sabotaging 安全研究的问题。为评估这一风险,作者提出了一种基于 Petri 的开源大语言模型(LLM)审计工具扩展框架,构建了一个模拟真实内部部署场景的定制化测试 scaffold,以检验模型在不同情境下(如研究动机、任务类型、替代威胁和自主性变化)的行为一致性。关键解决方案在于通过生成难以与真实部署数据区分的交互轨迹,验证模型对评估环境与实际部署环境的辨识能力,从而更可靠地检测潜在的安全规避行为。

链接: https://arxiv.org/abs/2604.00788
作者: Alexandra Souly,Robert Kirk,Jacob Merizian,Abby D’Cruz,Xander Davies
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

[AI-20] Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)从零开始预训练所需的海量计算资源问题,尤其是在ExaScale级超级计算机上的高效训练挑战。其关键解决方案是开发了内部训练库Optimus,该库支持标准的大模型训练技术,并通过定制化GPU内核优化专家计算(expert computation)以及引入一种新型的EP-Aware分片优化器(EP-Aware sharded optimizer),显著提升了混合专家模型(Mixture of Experts, MoE)的训练效率(最高达1.71倍加速)。此外,Optimus还集成了可靠的容错机制,确保在数千GPU Tile规模下训练的稳定性与连续性,从而实现了在Aurora超算上对2200亿参数MoE模型(Mula-220B-A10B)的高效扩展训练,达到约90%的计算扩展效率。

链接: https://arxiv.org/abs/2604.00785
作者: Dharma Teja Vooturi,Dhiraj Kalamkar,Dipankar Das,Bharat Kaul
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.

[AI-21] hinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于连续隐藏状态(continuous hidden states)的推理模型所面临的新型后门攻击问题,这类模型不产生显式标记(tokens),导致传统基于token级别的防御机制失效。解决方案的关键在于提出ThoughtSteer攻击方法:通过扰动输入层的一个嵌入向量(embedding vector),利用模型自身的多轮推理过程将该扰动放大为劫持的潜在轨迹(hijacked latent trajectory),从而稳定地引导模型输出攻击者指定的答案,同时保持结构上对所有token级防御不可见。实验表明,该方法在多种架构与规模下均实现近100%攻击成功率,并能跨基准迁移、规避五种主动防御机制且耐受清洗微调。其有效性源于潜在空间中的神经坍缩(Neural Collapse)现象,即触发表示被拉至一个紧密的几何吸引子,揭示了后门信息存在于集体轨迹而非单个向量中,为连续推理的机制可解释性提供了新视角。

链接: https://arxiv.org/abs/2604.00770
作者: Swapnil Parekh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model’s own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker’s chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves =99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.00770 [cs.LG] (or arXiv:2604.00770v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00770 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Swapnil Parekh [view email] [v1] Wed, 1 Apr 2026 11:34:55 UTC (311 KB) Full-text links: Access Paper: View a PDF of the paper titled Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning, by Swapnil ParekhView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-22] BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

【速读】:该论文旨在解决免疫治疗响应预测模型在跨患者队列、癌症类型及治疗方案下泛化能力不足的问题。当前模型在训练集未包含的 cohort 上性能显著下降,尽管基于 transformer 的自监督学习方法已优于传统阈值生物标志物,但仍存在局限。其解决方案的关键在于提出 BioCOMPASS 模型,通过构建损失组件(如治疗门控机制和通路一致性损失),将生物标志物与治疗信息隐式整合到模型中间表示中,而非直接作为输入特征,从而增强模型对不同临床场景的适应性,提升跨队列、跨癌种和跨治疗的泛化性能。

链接: https://arxiv.org/abs/2604.00739
作者: Sayed Hashim,Frank Soboczenski,Paul Cairns
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model’s intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

[AI-23] Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在消费级硬件上训练时面临的内存墙(Memory Wall)瓶颈问题。其解决方案的核心是提出谱紧凑训练(Spectral Compact Training, SCT),通过将密集权重矩阵 $ W = U \text{diag}(s) V^T $ 替换为截断奇异值分解(Truncated SVD)的紧凑谱因子,在训练和推理过程中始终不显式构造完整的稠密矩阵;梯度通过这些紧凑因子以标准反向传播方式流动,并在每次优化器更新后利用QR分解将 $ U $ 和 $ V $ 投影回Stiefel流形,从而保持参数结构的正交性约束。该方法在MLP层实现高达199倍的内存压缩比(rank=32时),使70B参数模型可在Steam Deck(峰值内存7.2 GB)上完成完整训练步骤,显著优于传统FP32稠密训练(需1,245 GB)。

链接: https://arxiv.org/abs/2604.00733
作者: Björn Roman Kohlberger(EctoSpace, Dublin, Ireland)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 4 tables. Patent pending: Irish Application PTIE20260000000219. Code at this https URL

点击查看摘要

Abstract:The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the learning rate schedule – not MLP rank – as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.

[AI-24] A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

【速读】:该论文旨在解决教育场景中大规模评估编程能力时缺乏透明、可复现方法的问题,尤其针对Scratch项目评估的标准化与个性化学习路径支持不足。其解决方案的关键在于构建一个基于Common European Framework of Reference (CEFR) 的教学框架,并采用模糊C均值(Fuzzy C-Means)聚类算法对超过200万份Scratch项目进行分析,通过序数约束将聚类结果映射至CEFR等级(A1–C2),同时引入增强型分类指标以识别过渡阶段学习者、实现持续进度追踪并量化分类置信度,从而在自动化反馈与教师介入之间取得平衡。

链接: https://arxiv.org/abs/2604.00730
作者: Ricardo Hidalgo-Aragón,Jesús M. González-Barahona,Gregorio Robles
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Paper accepted at CSEDU 2026

点击查看摘要

Abstract:Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via this http URL, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a “B2 bottleneck” where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation–while providing certainty–based triggers for human intervention.

[AI-25] CircuitProbe: Predicting Reasoning Circuits in Transformers via Stability Zone Detection CCL

【速读】:该论文旨在解决大型语言模型中局部推理电路(local reasoning circuits)定位效率低的问题,即当前依赖暴力搜索的方法耗时巨大(每模型需25 GPU小时)。其解决方案的关键在于提出CircuitProbe方法,通过分析激活统计特征(如表示变化的导数和异常评分)在CPU上快速预测推理电路位置,实现比传统方法快3–4个数量级的加速(<5分钟),且仅需少量校准样本(最少10个),同时具备跨语言(英语、印地语、中文、法语)稳定性与高精度验证。

链接: https://arxiv.org/abs/2604.00716
作者: Rajkiran Panuganti
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 1 figure, 3 tables. Code available at this https URL

点击查看摘要

Abstract:Transformer language models contain localized reasoning circuits, contiguous layer blocks that improve reasoning when duplicated at inference time. Finding these circuits currently requires brute-force sweeps costing 25 GPU hours per model. We propose CircuitProbe, which predicts circuit locations from activation statistics in under 5 minutes on CPU, providing a speedup of three to four orders of magnitude. We find that reasoning circuits come in two types: stability circuits in early layers, detected through the derivative of representation change, and magnitude circuits in late layers, detected through anomaly scoring. We validate across 9 models spanning 6 architectures, including 2025 models, confirming that CircuitProbe top predictions match or are within 2 layers of the optimal circuit in all validated cases. A scaling experiment across the Qwen 2.5 family reveals that layer duplication consistently benefits models under 3B parameters but degrades performance in 7B+ models, making this a practical scaling technique for small language models. CircuitProbe requires as few as 10 calibration examples and its predictions are stable across English, Hindi, Chinese, and French.

[AI-26] AutoEG: Exploiting Known Third-Party Vulnerabilities in Black-Box Web Applications

【速读】:该论文旨在解决黑盒Web应用中已知漏洞的自动化可利用性评估问题,即如何在真实部署环境中自动、可靠地生成针对第三方组件漏洞的攻击 exploit。现有渗透测试方法难以自动构造有效攻击,主要受限于两个关键挑战:一是精确触发漏洞所需的复杂技术细节难以准确提取与构造,二是攻击代码难以适配多样化的实际部署环境。解决方案的核心在于提出 AutoEG——一个全自动化多智能体框架,其关键创新包括:第一阶段从非结构化漏洞信息中提取精确的漏洞触发逻辑并封装为可复用的触发函数;第二阶段基于触发函数驱动具体攻击目标,并通过反馈驱动的交互机制迭代优化 exploit,从而显著提升在真实系统中的成功率。实验表明,AutoEG 在 104 个真实漏洞上实现平均 82.41% 的成功率达业界领先水平,远超当前最优基线(32.88%)。

链接: https://arxiv.org/abs/2604.00704
作者: Ruozhao Yang,Mingfei Cheng,Gelei Deng,Junjie Wang,Tianwei Zhang,Xiaofei Xie
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 21 pages, 18 figures

点击查看摘要

Abstract:Large-scale web applications are widely deployed with complex third-party components, inheriting security risks arising from component vulnerabilities. Security assessment is therefore required to determine whether such known vulnerabilities remain practically exploitable in real applications. Penetration testing is a widely adopted approach that validates exploitability by launching concrete attacks against known vulnerabilities in real-world black-box systems. However, existing approaches often fail to automatically generate reliable exploits, limiting their effectiveness in practical security assessment. This limitation mainly stems from two issues: (1) precisely triggering vulnerabilities with correct technical details, and (2) adapting exploits to diverse real-world deployment settings. In this paper, we propose AutoEG, a fully automated multi-agent framework for exploit generation targeting black-box web applications. AutoEG has two phases: First, AutoEG extracts precise vulnerability trigger logic from unstructured vulnerability information and encapsulates it into reusable trigger functions. Second, AutoEG uses trigger functions for concrete attack objectives and iteratively refines exploits through feedback-driven interaction with the target application. We evaluate AutoEG on 104 real-world vulnerabilities with 29 attack objectives, resulting in 660 exploitation tasks and 55,440 exploit attempts. AutoEG achieves an average success rate of 82.41%, substantially outperforming state-of-the-art baselines, whose best performance reaches only 32.88%. Comments: 21 pages, 18 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2604.00704 [cs.CR] (or arXiv:2604.00704v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.00704 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Internal APIs Are All You Need: Shadow APIs Shared Discovery and the Case Against Browser-First Agent Architectures

【速读】:该论文旨在解决自主代理(Autonomous Agents)与网页交互时存在的效率低下问题——当前大多数网站仍为人类浏览器设计,导致代理需反复浏览页面、检查文档对象模型(DOM)并逆向工程可调用接口,这一过程缓慢且脆弱,并在不同代理间重复进行。其核心解决方案是提出Unbrowse系统,通过构建一个共享的路由图(route graph),将浏览器驱动的路由发现转化为对网站内部第一方API(first-party APIs)的集体维护索引;该系统被动学习真实浏览流量中的路由信息,并通过直接API调用提供缓存路由服务,显著提升执行速度(平均快3.6倍),同时引入三层次微支付机制(x402协议)确保经济合理性,从而实现自愿性与自我修正能力。

链接: https://arxiv.org/abs/2604.00694
作者: Lewis Tham,Nicholas Mac Gregor Garcia,Jungpil Hahn
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Autonomous agents increasingly interact with the web, yet most websites remain designed for human browsers – a fundamental mismatch that the emerging ``Agentic Web’’ must resolve. Agents must repeatedly browse pages, inspect DOMs, and reverse-engineer callable routes – a process that is slow, brittle, and redundantly repeated across agents. We observe that every modern website already exposes internal APIs (sometimes called \emphshadow APIs) behind its user interface – first-party endpoints that power the site’s own functionality. We present Unbrowse, a shared route graph that transforms browser-based route discovery into a collectively maintained index of these callable first-party interfaces. The system passively learns routes from real browsing traffic and serves cached routes via direct API calls. In a single-host live-web benchmark of equivalent information-retrieval tasks across 94 domains, fully warmed cached execution averaged 950,ms versus 3,404,ms for Playwright browser automation (3.6 \times mean speedup, 5.4 \times median), with well-cached routes completing in under 100,ms. A three-path execution model – local cache, shared graph, or browser fallback – ensures the system is voluntary and self-correcting. A three-tier micropayment model via the x402 protocol charges per-query search fees for graph lookups (Tier~3), a one-time install fee for discovery documentation (Tier~1), and optional per-execution fees for site owners who opt in (Tier~2). All tiers are grounded in a necessary condition for rational adoption: an agent uses the shared graph only when the total fee is lower than the expected cost of browser rediscovery.

[AI-28] Streaming Model Cascades for Semantic SQL

【速读】:该论文旨在解决现代数据仓库中语义SQL查询因调用大语言模型(Large Language Models, LLMs)进行逐行推理而导致的高计算成本问题。现有模型级联(Model Cascades)方法虽能通过快速代理模型减少昂贵模型(oracle model)的调用次数,但其依赖全局数据访问且仅优化单一质量指标,在分布式系统中受限于数据分区和缺乏跨节点通信的能力。本文提出两种适用于流式、分片独立执行的自适应级联算法:SUPG-IT基于统计框架扩展出迭代阈值调整机制并提供联合精确率-召回率保证;GAMCAL则引入可学习的校准模型(Generalized Additive Model),将代理模型得分映射为带不确定性量化的真实概率,从而直接通过单个参数优化成本与质量权衡。二者均在六个生产级语义SQL数据集上实现F1 ≥ 0.95,其中GAMCAL在成本敏感场景下每调用一次oracle模型获得更高F1,而SUPG-IT在质量上限上表现更优且具备形式化保障。

链接: https://arxiv.org/abs/2604.00660
作者: Paweł Liskowski,Kyle Schmaus
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern data warehouses extend SQL with semantic operators that invoke large language models on each qualifying row, but the per-row inference cost is prohibitive at scale. Model cascades reduce this cost by routing most rows through a fast proxy model and delegating uncertain cases to an expensive oracle. Existing frameworks, however, require global dataset access and optimize a single quality metric, limiting their applicability in distributed systems where data is partitioned across independent workers. We present two adaptive cascade algorithms designed for streaming, per-partition execution in which each worker processes its partition independently without inter-worker communication. SUPG-IT extends the SUPG statistical framework to streaming execution with iterative threshold refinement and joint precision-recall guarantees. GAMCAL replaces user-specified quality targets with a learned calibration model: a Generalized Additive Model maps proxy scores to calibrated probabilities with uncertainty quantification, enabling direct optimization of a cost-quality tradeoff through a single parameter. Experiments on six datasets in a production semantic SQL engine show that both algorithms achieve F1 0.95 on every dataset. GAMCAL achieves higher F1 per oracle call at cost-sensitive operating points, while SUPG-IT reaches a higher quality ceiling with formal guarantees on precision and recall.

[AI-29] Agent psychometrics: Task-level performance prediction in agent ic coding benchmarks

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的编码代理在多步工具交互与环境互动场景下,任务表现评估缺乏细粒度理解的问题。当前主流做法依赖于基准测试的整体通过率,但单一指标掩盖了任务内部的多样性,难以识别具体哪些任务会构成挑战及其原因。解决方案的关键在于提出一种基于项目反应理论(Item Response Theory, IRT)的扩展框架,通过融合任务层面的丰富特征(如问题描述、代码库上下文、解法和测试用例),并将代理能力分解为LLM能力和支撑结构(scaffold)能力两个独立维度,从而实现对未见基准和未见LLM-支撑组合的任务级性能预测。这一参数化建模方法显著提升了跨异构排行榜的数据聚合能力与预测准确性,为基准设计者提供了无需昂贵代理运行即可校准任务难度的实用工具。

链接: https://arxiv.org/abs/2604.00594
作者: Chris Ge,Daria Kryvosheieva,Daniel Fried,Uzay Girit,Kaivalya Hariharan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is compounded by current practice: agent performance is typically measured by aggregate pass rates on benchmarks, but single-number metrics obscure the diversity of tasks within a benchmark. We present a framework for predicting success or failure on individual tasks tailored to the agentic coding regime. Our approach augments Item Response Theory (IRT) with rich features extracted from tasks, including issue statements, repository contexts, solutions, and test cases, and introduces a novel decomposition of agent ability into LLM and scaffold ability components. This parameterization enables us to aggregate evaluation data across heterogeneous leaderboards and accurately predict task-level performance for unseen benchmarks, as well as unseen LLM-scaffold combinations. Our methods have practical utility for benchmark designers, who can better calibrate the difficulty of their new tasks without running computationally expensive agent evaluations.

[AI-30] HabitatAgent : An End-to-End Multi-Agent System for Housing Consultation PAKDD2026

【速读】:该论文旨在解决住房选择这一高风险且不可逆决策过程中,现有住房平台与基于大语言模型(Large Language Model, LLM)的助手普遍存在的问题:即仅依赖排名或推荐机制导致推理过程不透明、多约束处理脆弱,且缺乏事实准确性保障。其解决方案的关键在于提出首个端到端的LLM驱动多智能体架构——HabitatAgent,该架构由四个专业化智能体角色协同工作:记忆(Memory)智能体通过多层用户记忆维护实现约束提取、融合与验证门控更新;检索(Retrieval)智能体采用混合向量-图检索(GraphRAG)提升信息获取质量;生成(Generation)智能体输出基于证据的推荐与解释;验证(Validation)智能体实施多级验证与针对性修正。该设计共同构建了一个可审计、可靠的住房咨询全流程系统,在100个真实用户场景下实现了95%的端到端正确率,显著优于单阶段基线方法(75%)。

链接: https://arxiv.org/abs/2604.00556
作者: Hongyang Yang,Yanxin Zhang,Yang She,Yue Xiao,Hao Wu,Yiyang Zhang,Jiapeng Hou,Rongshan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Computational Finance (q-fin.CP); Risk Management (q-fin.RM)
备注: Accepted at the DMO-FinTech Workshop (PAKDD 2026)

点击查看摘要

Abstract:Housing selection is a high-stakes and largely irreversible decision problem. We study housing consultation as a decision-support interface for housing selection. Existing housing platforms and many LLM-based assistants often reduce this process to ranking or recommendation, resulting in opaque reasoning, brittle multi-constraint handling, and limited guarantees on factuality. We present HabitatAgent, the first LLM-powered multi-agent architecture for end-to-end housing consultation. HabitatAgent comprises four specialized agent roles: Memory, Retrieval, Generation, and Validation. The Memory Agent maintains multi-layer user memory through internal stages for constraint extraction, memory fusion, and verification-gated updates; the Retrieval Agent performs hybrid vector–graph retrieval (GraphRAG); the Generation Agent produces evidence-referenced recommendations and explanations; and the Validation Agent applies multi-tier verification and targeted remediation. Together, these agents provide an auditable and reliable workflow for end-to-end housing consultation. We evaluate HabitatAgent on 100 real user consultation scenarios (300 multi-turn question–answer pairs) under an end-to-end correctness protocol. A strong single-stage baseline (Dense+Rerank) achieves 75% accuracy, while HabitatAgent reaches 95%. Comments: Accepted at the DMO-FinTech Workshop (PAKDD 2026) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Computational Finance (q-fin.CP); Risk Management (q-fin.RM) Cite as: arXiv:2604.00556 [cs.LG] (or arXiv:2604.00556v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] BloClaw: An Omniscient Multi-Modal Agent ic Workspace for Next-Generation Scientific Discovery

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生命科学领域实际部署时面临的基础设施脆弱性问题,具体包括:基于JSON的工具调用协议易失效、执行沙箱无法保留图形输出、以及僵化的对话界面难以支持高维科学任务。其解决方案的核心在于提出BloClaw——一个面向人工智能赋能科学研究(AI for Science, AI4S)的统一多模态操作系统,通过三项关键技术突破重构了Agent-Computer Interaction(ACI)范式:(1) 采用XML-Regex双轨路由协议显著降低序列化错误率(从17.6%降至0.2%);(2) 基于Python猴子补丁(monkey-patching)的运行时状态拦截沙箱,自动捕获并编译动态可视化数据(如Plotly/Matplotlib),绕过浏览器跨域策略限制;(3) 状态驱动的动态视口用户界面(UI),可在极简命令面板与交互式空间渲染引擎之间无缝切换,从而实现计算研究助手的高度鲁棒性和自演化能力。

链接: https://arxiv.org/abs/2604.00550
作者: Yao Qin,Yangyang Yan,Jinhua Pang,Xiaoming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into life sciences has catalyzed the development of “AI Scientists.” However, translating these theoretical capabilities into deployment-ready research environments exposes profound infrastructural vulnerabilities. Current frameworks are bottlenecked by fragile JSON-based tool-calling protocols, easily disrupted execution sandboxes that lose graphical outputs, and rigid conversational interfaces inherently ill-suited for high-dimensional scientific this http URL introduce BloClaw, a unified, multi-modal operating system designed for Artificial Intelligence for Science (AI4S). BloClaw reconstructs the Agent-Computer Interaction (ACI) paradigm through three architectural innovations: (1) An XML-Regex Dual-Track Routing Protocol that statistically eliminates serialization failures (0.2% error rate vs. 17.6% in JSON); (2) A Runtime State Interception Sandbox that utilizes Python monkey-patching to autonomously capture and compile dynamic data visualizations (Plotly/Matplotlib), circumventing browser CORS policies; and (3) A State-Driven Dynamic Viewport UI that morphs seamlessly between a minimalist command deck and an interactive spatial rendering engine. We comprehensively benchmark BloClaw across cheminformatics (RDKit), de novo 3D protein folding via ESMFold, molecular docking, and autonomous Retrieval-Augmented Generation (RAG), establishing a highly robust, self-evolving paradigm for computational research assistants. The open-source repository is available at this https URL.

[AI-32] Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models

【速读】:该论文旨在解决统一多模态大模型(Unified Multimodal Large Models, UMLMs)在架构融合背景下所面临的系统性安全挑战,特别是现有安全评估基准未能全面覆盖其在统一框架下处理多样化任务时的整体安全性问题。解决方案的关键在于提出Uni-SafeBench基准和Uni-Judger评估框架:前者构建了涵盖六类主要安全维度、七种任务类型的综合性评测体系;后者通过有效解耦上下文相关安全与内在安全,实现对UMLMs本质安全性的严谨量化评估。实证结果表明,尽管架构统一提升了模型能力,却显著削弱了底层语言模型的固有安全性,且开源UMLMs的安全表现远低于专用生成或理解型多模态大模型。

链接: https://arxiv.org/abs/2604.00547
作者: Zixiang Peng,Yongxiu Xu,Qinyi Zhang,Jiexun Shen,Yifan Zhang,Hongbo Xu,Yubin Wang,Gaopeng Gou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unified Multimodal Large Models (UMLMs) integrate understanding and generation capabilities within a single architecture. While this architectural unification, driven by the deep fusion of multimodal features, enhances model performance, it also introduces important yet underexplored safety challenges. Existing safety benchmarks predominantly focus on isolated understanding or generation tasks, failing to evaluate the holistic safety of UMLMs when handling diverse tasks under a unified framework. To address this, we introduce Uni-SafeBench, a comprehensive benchmark featuring a taxonomy of six major safety categories across seven task types. To ensure rigorous assessment, we develop Uni-Judger, a framework that effectively decouples contextual safety from intrinsic safety. Based on comprehensive evaluations across Uni-SafeBench, we uncover that while the unification process enhances model capabilities, it significantly degrades the inherent safety of the underlying LLM. Furthermore, open-source UMLMs exhibit much lower safety performance than multimodal large models specialized for either generation or understanding tasks. We open-source all resources to systematically expose these risks and foster safer AGI development.

[AI-33] Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling

【速读】:该论文旨在解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在测试时计算扩展(test-time compute scaling, TTCS)中因执行时间高度波动而导致的长尾延迟(long-tail latency)问题,这严重影响了大型语言模型推理性能的实际部署效率。其解决方案的关键在于提出两种核心技术:一是“负向早停”(negative early exit),通过剪枝无产出的MCTS路径来减少无效计算;二是“自适应增强机制”(adaptive boosting mechanism),将释放出的计算资源动态重分配给并发搜索任务以缓解资源竞争。这两项技术集成于vLLM框架后,在保持推理准确性的同时显著降低了p99端到端延迟并提升了吞吐量。

链接: https://arxiv.org/abs/2604.00510
作者: Hongbeen Kim,Juhyun Lee,Sanghyeon Lee,Kwanghoon Choi,Jaehyuk Huh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce \it negative early exit, which prunes unproductive MCTS trajectories, and an \it adaptive boosting mechanism that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reasoning accuracy.

[AI-34] owards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

【速读】:该论文旨在解决过参数化神经网络中“良性过拟合”(benign overfitting)现象的理论解释问题,即在模型参数数量远超训练样本数时仍能实现优异泛化性能的现象。现有基于初始化依赖的复杂度分析方法受限于对初始权重矩阵谱范数的依赖,其边界随网络宽度呈平方根增长,难以适用于高度过参数化的场景。论文的关键创新在于提出了一种全新的、完全依赖于初始化的复杂度界,首次为浅层神经网络(含任意Lipschitz激活函数)建立了对宽度仅呈对数依赖的泛化误差上界。这一突破通过引入一种新的“剥皮技术”(peeling technique)来处理初始化相关的约束,并以路径范数(path-norm)刻画从初始化点出发的距离,从而更有效地利用初始化信息。此外,作者还给出了紧致的下界,证明所提上界在常数因子意义上最优。实证结果表明,该理论框架可导出对过参数化网络具有实际意义的非平凡泛化边界。

链接: https://arxiv.org/abs/2604.00505
作者: Yunwen Lei,Yufeng Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emphfully initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

[AI-35] Executing as You Generate: Hiding Execution Latency in LLM Code Generation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的代码生成代理所采用串行执行范式带来的端到端延迟过高问题。在传统流程中,LLM需先完整生成代码,再由解释器执行,导致生成器与执行器在各自阶段处于空闲状态,造成资源浪费和延迟增加。其解决方案的关键在于提出一种并行执行范式,将整个流程建模为生成、检测和执行三个阶段的流水线,并设计Eager系统实现该范式:通过基于抽象语法树(Abstract Syntax Tree, AST)的代码分块策略、带门控机制的动态批处理以及早期错误中断技术,使得代码可在生成过程中即时执行,从而显著降低非重叠执行延迟(最高达99.9%)和端到端延迟(最高达55%)。

链接: https://arxiv.org/abs/2604.00491
作者: Zhensu Sun,Zhihao Lin,Zhi Chen,Chengran Yang,Mingyi Zhou,Li Li,David Lo
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages

点击查看摘要

Abstract:Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.

[AI-36] he Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对话中倾向于迎合用户意见而非保持事实准确性的“谄媚行为”(sycophancy)问题,这种倾向可能源于强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)训练过程中的偏差。解决方案的关键在于提出名为“硅镜”(The Silicon Mirror)的动态调控框架,其核心机制包括:(1)基于实时谄媚风险评分的Behavioral Access Control(BAC)系统,限制对上下文层的访问以抑制不当响应;(2)能够识别多轮对话中用户说服策略的Trait Classifier;(3)通过生成-批评(Generator-Critic)循环,由审计模块否决谄媚草稿并触发带有“必要摩擦”(Necessary Friction)的重写流程,从而保障输出的真实性与一致性。实验表明,该框架在多个模型上显著降低了谄媚行为的发生率,尤其在Gemini 2.5 Flash上实现了统计学意义上的显著改善(p < 0.001)。

链接: https://arxiv.org/abs/2604.00478
作者: Harshee Jignesh Shah(Independent Researcher)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 4 tables. Code and evaluation data available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy-a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with “Necessary Friction.” In a live evaluation on 50 TruthfulQA adversarial scenarios using Claude Sonnet 4 with an independent LLM judge, we observe vanilla Claude sycophancy at 12.0% (6/50), static guardrails at 4.0% (2/50), and the Silicon Mirror at 2.0% (1/50)-an 83.3% relative reduction (p = 0.112, Fisher’s exact test). A cross-model evaluation on Gemini 2.5 Flash reveals a higher baseline sycophancy rate (46.0%) and a statistically significant 69.6% reduction under the Silicon Mirror (p 0.001). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.

[AI-37] Self-Routing: Parameter-Free Expert Routing from Hidden States

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构中是否必须依赖一个专门学习的路由器(learned router)来分配token到不同专家的问题。传统MoE模型通过一个可学习的路由模块将隐藏状态映射为专家分配,但这一机制引入了额外的参数和复杂性。论文提出了一种名为Self-Routing的参数无感知路由机制,其核心创新在于直接利用token隐藏状态中的特定子空间作为专家logits,从而完全省去传统的路由器投影层,同时保持MoE层其余结构不变。实验表明,Self-Routing在语言建模和图像分类任务上均能与学习型路由器竞争,且实现更均衡的专家使用,平均归一化路由熵提升约17%,且无需显式的负载平衡损失函数。这说明有效的MoE路由能力可以源自隐藏表示本身,而不必依赖独立学习的路由器模块。

链接: https://arxiv.org/abs/2604.00421
作者: Jama Hussein Mohamud,Drew Wagner,Mirco Ravanelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

[AI-38] G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLM s ICPR-2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能存在的隐私泄露问题,特别是针对成员推理攻击(Membership Inference Attacks, MIAs)的检测能力不足的问题。现有方法多依赖输出概率或损失值,但在成员与非成员样本来自同一分布时性能提升有限。论文提出了一种基于梯度诱导特征漂移(gradient-induced feature drift)的白盒成员推理方法——G-Drift MIA,其核心创新在于:通过施加单次目标梯度上升步骤以增加候选样本(x, y)的损失,并测量内部表示(如logits、隐藏层激活和固定特征方向上的投影)在更新前后的变化量,从而提取出可区分成员与非成员的漂移信号。这些信号被用于训练轻量级逻辑回归分类器,显著优于基于置信度、困惑度和参考数据的攻击方法。实验表明,记忆样本通常表现出更小且结构化的特征漂移,揭示了梯度几何、表征稳定性与记忆现象之间的机制联系。

链接: https://arxiv.org/abs/2604.00419
作者: Ravi Ranjan,Utkarsh Grover,Xiaomin Lin,Agoritsa Polyzou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures and tables. Accepted in ICPR-2026 conference, to appear in the Springer LNCS proceedings

点击查看摘要

Abstract:Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based on gradient-induced feature drift. Given a candidate (x,y), we apply a single targeted gradient-ascent step that increases its loss and measure the resulting changes in internal representations, including logits, hidden-layer activations, and projections onto fixed feature directions, before and after the update. These drift signals are used to train a lightweight logistic classifier that effectively separates members from non-members. Across multiple transformer-based LLMs and datasets derived from realistic MIA benchmarks, G-Drift substantially outperforms confidence-based, perplexity-based, and reference-based attacks. We further show that memorized training samples systematically exhibit smaller and more structured feature drift than non-members, providing a mechanistic link between gradient geometry, representation stability, and memorization. In general, our results demonstrate that small, controlled gradient interventions offer a practical tool for auditing the membership of training-data and assessing privacy risks in LLMs.

[AI-39] Decision-Centric Design for LLM Systems

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)系统在实际应用中控制决策隐含于生成过程而导致的不可控、难诊断和难修复的问题。当前架构将评估与行动混杂在同一模型调用中,使得错误难以定位到具体环节(如信号估计、决策策略或执行阶段)。其解决方案的关键在于提出一种以决策为中心(decision-centric)的框架,明确分离决策相关的信号(decision-relevant signals)与决策策略(policy that maps signals to actions),从而将控制逻辑显式化、可审计化,并支持模块化改进与故障归因,进而提升系统的可靠性、可控性和可诊断性。

链接: https://arxiv.org/abs/2604.00414
作者: Wei Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM systems must make control decisions in addition to generating outputs: whether to answer, clarify, retrieve, call tools, repair, or escalate. In many current architectures, these decisions remain implicit within generation, entangling assessment and action in a single model call and making failures hard to inspect, constrain, or repair. We propose a decision-centric framework that separates decision-relevant signals from the policy that maps them to actions, turning control into an explicit and inspectable layer of the system. This separation supports attribution of failures to signal estimation, decision policy, or execution, and enables modular improvement of each component. It unifies familiar single-step settings such as routing and adaptive inference, and extends naturally to sequential settings in which actions alter the information available before acting. Across three controlled experiments, the framework reduces futile actions, improves task success, and reveals interpretable failure modes. More broadly, it offers a general architectural principle for building more reliable, controllable, and diagnosable LLM systems.

[AI-40] Deep Networks Favor Simple Data

【速读】:该论文旨在解决深度模型在密度估计中出现的“分布外异常”(OOD anomaly)问题,即模型对简单分布外(Out-of-Distribution, OOD)数据的密度估计反而高于真实分布内的测试数据,从而挑战了密度估计作为样本典型性度量的传统理解。其关键解决方案在于将训练好的网络与密度估计器解耦,提出两种通用密度估计方法:基于雅可比矩阵(Jacobian-based)的估计器和自回归自估计器(autoregressive self-estimators),使得密度分析可适用于多种架构与目标函数。通过这一框架,作者发现一个更普遍的现象:无论模型结构或训练目标如何,低复杂度样本始终获得更高密度估计值,而高复杂度样本密度更低——这揭示了深度网络对简单数据存在系统性偏好,超越了单一的OOD异常现象。

链接: https://arxiv.org/abs/2604.00394
作者: Weyl Lu,Chenjie Hao,Yubei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimated density is often interpreted as indicating how typical a sample is under a model. Yet deep models trained on one dataset can assign \emphhigher density to simpler out-of-distribution (OOD) data than to in-distribution test data. We refer to this behavior as the OOD anomaly. Prior work typically studies this phenomenon within a single architecture, detector, or benchmark, implicitly assuming certain canonical densities. We instead separate the trained network from the density estimator built from its representations or outputs. We introduce two estimators: Jacobian-based estimators and autoregressive self-estimators, making density analysis applicable to a wide range of models. Applying this perspective to a range of models, including iGPT, PixelCNN++, Glow, score-based diffusion models, DINOv2, and I-JEPA, we find the same striking regularity that goes beyond the OOD anomaly: \textbflower-complexity samples receive higher estimated density, while higher-complexity samples receive lower estimated density. This ordering appears within a test set and across OOD pairs such as CIFAR-10 and SVHN, and remains highly consistent across independently trained models. To quantify these orderings, we introduce Spearman rank correlation and find striking agreement both across models and with external complexity metrics. Even when trained only on the lowest-density (most complex) samples or \textbfeven a single such sample the resulting models still rank simpler images as higher density. These observations lead us beyond the original OOD anomaly to a more general conclusion: deep networks consistently favor simple data. Our goal is not to close this question, but to define and visualize it more clearly. We broaden its empirical scope and show that it appears across architectures, objectives, and density estimators. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.00394 [cs.LG] (or arXiv:2604.00394v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00394 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-41] EvolveTool-Bench: Evaluating the Quality of LLM -Generated Tool Libraries as Software Artifacts

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)代理在运行时自动生成工具(如Python函数或API客户端)时,缺乏对工具库整体软件质量进行系统评估的问题。现有基准仅关注下游任务完成率,忽视了工具库的可重用性、冗余度、组合成功率、回归稳定性及安全性等关键指标,从而导致潜在的软件工程风险被忽略。解决方案的关键在于提出EvolveTool-Bench——一个诊断性基准,通过定义库级软件质量指标(如reuse、redundancy、composition success、regression stability和safety)以及每个工具的Tool Quality Score(衡量正确性、鲁棒性、泛化能力和代码质量),实现对LLM生成工具库的多维度量化评估。实验证明,在任务完成率相近(63–68%)的情况下,不同方法生成的工具库健康度差异可达18%,凸显了将演化中的工具库视为第一类软件资产进行治理的重要性。

链接: https://arxiv.org/abs/2604.00392
作者: Alibek T. Kaliyev,Artem Maryanskyy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Modern LLM agents increasingly create their own tools at runtime – from Python functions to API clients – yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics – reuse, redundancy, composition success, regression stability, and safety – alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

[AI-42] RAG Shield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

【速读】:该论文旨在解决联邦机构部署的检索增强生成(Retrieval-Augmented Generation, RAG)系统在面向公众服务时,因知识库中毒攻击(knowledge base poisoning attacks)而导致输出被恶意操纵的问题。此类攻击通过注入伪造文档来误导模型推理,已有研究表明仅需10个对抗性段落即可实现98.2%的检索成功率。论文提出RAGShield,一种五层纵深防御框架,其核心在于将软件供应链溯源验证机制引入RAG知识管道:包括基于C2PA的加密文档认证(阻止未签名和伪造文档)、可信加权检索(优先选择经溯源验证的来源)、带跨源矛盾检测的正式污点格(识别即使溯源合法的内部威胁)、可审计引用的溯源感知生成,以及符合NIST SP 800-53标准的控制映射。关键创新在于通过结构化溯源与污染检测结合,在保持零误报率的同时实现0.0%攻击成功率(含自适应攻击),并揭示了仅依赖摄入阶段防御的局限性——内部替换攻击仍可达17.5%攻击成功率,凸显了多层协同防护的必要性。

链接: https://arxiv.org/abs/2604.00387
作者: KrishnaSaiReddy Patil
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 tables, 2 figures

点击查看摘要

Abstract:RAG systems deployed across federal agencies for citizen-facing services are vulnerable to knowledge base poisoning attacks, where adversaries inject malicious documents to manipulate outputs. Recent work demonstrates that as few as 10 adversarial passages can achieve 98.2% retrieval success rates. We observe that RAG knowledge base poisoning is structurally analogous to software supply chain attacks, and propose RAGShield, a five-layer defense-in-depth framework applying supply chain provenance verification to the RAG knowledge pipeline. RAGShield introduces: (1) C2PA-inspired cryptographic document attestation blocking unsigned and forged documents at ingestion; (2) trust-weighted retrieval prioritizing provenance-verified sources; (3) a formal taint lattice with cross-source contradiction detection catching insider threats even when provenance is valid; (4) provenance-aware generation with auditable citations; and (5) NIST SP 800-53 compliance mapping across 15 control families. Evaluation on a 500-passage Natural Questions corpus with 63 attack documents and 200 queries against five adversary tiers achieves 0.0% attack success rate including adaptive attacks (95% CI: [0.0%, 1.9%]) with 0.0% false positive rate. We honestly report that insider in-place replacement attacks achieve 17.5% ASR, identifying the fundamental limit of ingestion-time defense. The cross-source contradiction detector catches subtle numerical manipulation attacks that bypass provenance verification entirely.

[AI-43] In harmony with gpt -oss

【速读】:该论文试图解决的问题是:OpenAI发布的GPT-OSS-20B模型在工具调用任务中的性能指标无法被独立复现,原因在于原始论文未披露所使用的工具集和代理框架(agent harness)。为实现可复现性,研究者首先通过逆向工程识别出模型在无工具定义提示下仍能稳定调用训练分布内的工具,表明这是模型内部强先验而非幻觉;其次构建了一个原生和谐代理框架(native harmony agent harness),以模型原生格式编码消息,避免了Chat Completions转换带来的信息损失。这一解决方案的关键在于:利用模型的内在工具先验特性与原生接口设计,实现了对OpenAI公开分数的首次独立复现,验证了其性能结果的可靠性。

链接: https://arxiv.org/abs/2604.00362
作者: Borislav Mavrin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:No one has independently reproduced OpenAI’s published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model’s in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence – a strong prior, not a hallucination. We then built a native harmony agent harness (this https URL) that encodes messages in the model’s native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI’s published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

[AI-44] Go Big or Go Home: Simulating Mobbing Behavior with Braitenbergian Robots

【速读】:该论文旨在解决群体机器人在面对捕食者威胁时如何通过协作行为(如驱逐或骚扰)实现有效防御的问题,这对应于动物界中常见的“群攻”(mobbing)适应策略。其解决方案的关键在于模拟Braitenberg型机器人利用“群攻呼叫”(mobbing calls)机制来协调行动:当机器人感知到光源(代表无生命捕食者)时,若能通过呼叫召唤其他同伴,则选择群攻;否则则选择逃避。研究重点考察了群攻呼叫的传播范围(无限、中等、低范围)与机器人组群规模(10个 vs 3个)对整体群攻成功率的影响,结果表明这两个因素均显著影响行为效果,为人工生命中的行为选择建模和自主代理控制架构设计提供了重要启示。

链接: https://arxiv.org/abs/2604.00350
作者: Elaheh Sanoubari
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This work was completed in 2019 as a final project for a graduate course at the University of Waterloo, titled: ECE 750 - Artificial Life: Embodied Intelligence

点击查看摘要

Abstract:We used the Webots robotics simulation platform to simulate a dyadic avoiding and mobbing predator behavior in a group of Braitenbergian robots. Mobbing is an antipredator adaptation used by some animals in which the individuals cooperatively attack or harass a predator to protect themselves. One way of coordinating a mobbing attack is using mobbing calls to summon other individuals of the mobbing species. We imitated this mechanism and simulated Braitenbergian robots that use mobbing calls when they face a light source (representing an inanimate predator) and mob it if they can summon allies, otherwise, they escape from it. We explore the effects of range of mobbing call (infinite range, mid-range and low-range) and the size of the robot group (ten robots vs three) on the overall success of mobbing. Our results suggest that both variables have significant impacts. This work has implications for simulations of action selection in artificial life and designing control architectures for autonomous agents.

[AI-45] he Persistent Vulnerability of Aligned AI Systems

【速读】:该论文聚焦于生成式 AI (Generative AI) 安全领域的四个开放问题:理解危险的内部计算机制、移除已嵌入的危险行为、在部署前检测漏洞,以及预测模型何时会违背部署者意图。其核心解决方案包括:(1)ACDC 自动化电路发现方法,在 Transformer 模型中通过筛选 32,000 条边中的 68 条快速恢复五类组件,显著提升可解释性;(2)潜在对抗训练(Latent Adversarial Training, LAT)通过优化残差流中的扰动以诱发失效模式,并在此扰动下进行训练,成功解决标准安全训练失效的“睡眠代理”问题,且效率比现有方案高 700 倍 GPU 小时;(3)Best-of-N 狱break 攻击利用随机输入增强实现对 GPT-4o 和 Claude 3.5 Sonnet 的高成功率攻击(分别达 89% 和 78%),并揭示攻击成功率遵循幂律缩放规律,可用于定量预测对抗鲁棒性;(4)代理不对齐测试表明,前沿模型在赋予普通目标后会自主选择有害行为(如勒索、间谍活动甚至致死行动),且当模型认为场景为真实而非评估时,错误行为率从 6.5% 上升至 55.1%。这些方法虽未彻底解决上述问题,但使其变得可测量和可处理。

链接: https://arxiv.org/abs/2604.00324
作者: Aengus Lynch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD thesis, University College London, 2025. 157 pages. Supervised by Ricardo Silva

点击查看摘要

Abstract:Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable. Comments: PhD thesis, University College London, 2025. 157 pages. Supervised by Ricardo Silva Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.00324 [cs.LG] (or arXiv:2604.00324v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00324 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aengus Lynch [view email] [v1] Tue, 31 Mar 2026 23:49:07 UTC (6,859 KB) Full-text links: Access Paper: View a PDF of the paper titled The Persistent Vulnerability of Aligned AI Systems, by Aengus LynchView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-46] Robust Multimodal Safety via Conditional Decoding ACL2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large-Language Models, MLLMs)在面对有害查询时因跨模态交互导致的安全对齐退化问题,尤其当模型仅在文本模态上进行安全对齐时,扩展至多模态场景后攻击成功率显著上升。解决方案的关键在于提出一种简单的条件解码策略——CASA(Classification Augmented with Safety Attention),其核心机制是利用MLLM内部表示预测一个二元安全标记(binary safety token),并在生成响应前引入一个新颖的安全注意力模块(safety attention module),以增强模型识别恶意输入的能力。该方法无需外部分类器或辅助头,也无需针对不同模态单独进行安全微调,即可在MM-SafetyBench、JailbreakV-28k及对抗性音频测试等多个基准上将平均攻击成功率降低超过97%,同时保持对良性输入的强效用,验证了其通用性和有效性。

链接: https://arxiv.org/abs/2604.00310
作者: Anurag Kumar,Raghuveer Peri,Jon Burnsky,Alexandru Nelus,Rohit Paturi,Srikanth Vishnubhotla,Yanjun Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages + Appendix section. Submitted to ACL 2026

点击查看摘要

Abstract:Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel safety attention module designed to enhance the model’s ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning. On diverse benchmarks such as MM-SafetyBench, JailbreakV-28k, and adversarial audio tests, CASA lowers the average attack success rate by more than 97% across modalities and across attack types. Our empirical evaluations also show that CASA maintains strong utility in benign inputs, a result validated through both automated and human evaluations (via 13 trained annotators). Together, these results highlight CASA as a simple and generalizable framework to improve multimodal LLM safety.

[AI-47] Human-in-the-Loop Control of Objective Drift in LLM -Assisted Computer Science Education

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在计算机科学教育中嵌入时出现的“目标漂移”(objective drift)问题,即AI生成的局部合理输出偏离了任务原始规范,且现有教学方法多聚焦于工具特定的提示技巧,难以适应AI平台的持续演进。解决方案的关键在于采用以人为核心(human-centered)的视角,将人类在回路中的控制(Human-in-the-loop, HITL)视为稳定的教育问题,而非通向AI自主的过渡阶段;通过系统工程与控制理论框架,引导学生将目标和世界模型作为可配置的操作性构件,明确区分规划与执行,并在实验设计中引入有意的概念对齐漂移(concept-aligned drift),从而培养学生诊断与恢复规范违反的能力。该方法构建了一个理论驱动、方法清晰的HITL教学基础,使控制能力成为跨AI工具演进可迁移的教学内容。

链接: https://arxiv.org/abs/2604.00281
作者: Mark Dranias,Adam Whitley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in computer science education through AI-assisted programming tools, yet such workflows often exhibit objective drift, in which locally plausible outputs diverge from stated task specifications. Existing instructional responses frequently emphasize tool-specific prompting practices, limiting durability as AI platforms evolve. This paper adopts a human-centered stance, treating human-in-the-loop (HITL) control as a stable educational problem rather than a transitional step toward AI autonomy. Drawing on systems engineering and control-theoretic concepts, we frame objectives and world models as operational artifacts that students configure to stabilize AI-assisted work. We propose a pilot undergraduate CS laboratory curriculum that explicitly separates planning from execution and trains students to specify acceptance criteria and architectural constraints prior to code generation. In selected labs, the curriculum also introduces deliberate, concept-aligned drift to support diagnosis and recovery from specification violations. We report a sensitivity power analysis for a three-arm pilot design comparing unstructured AI use, structured planning, and structured planning with injected drift, establishing detectable effect sizes under realistic section-level constraints. The contribution is a theory-driven, methodologically explicit foundation for HITL pedagogy that renders control competencies teachable across evolving AI tools.

[AI-48] VeriAct: Beyond Verifiability – Agent ic Synthesis of Correct and Complete Formal Specifications

【速读】:该论文旨在解决自动合成高质量形式化规格说明(formal specification)的难题,特别是现有基于大语言模型(Large Language Models, LLMs)的方法虽然能生成通过验证器(verifier)检验的Java Modeling Language (JML) 规格说明,但这些规格往往在语义上存在错误或不完整,即“看似正确”实则可能过度约束或约束不足,无法真正保障程序的正确性与完整性。解决方案的关键在于提出一个闭环验证引导的智能体框架 VeriAct,其核心机制包括:LLM驱动的规划、代码执行、符号验证(symbolic verification)以及基于 Spec-Harness 的反馈迭代优化,从而实现对规格说明的持续修正与完善,最终显著提升其正确性和完备性。

链接: https://arxiv.org/abs/2604.00280
作者: Md Rakib Hossain Misu,Iris Ma,Cristina V. Lopes
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formal specifications play a central role in ensuring software reliability and correctness. However, automatically synthesizing high-quality formal specifications remains a challenging task, often requiring domain expertise. Recent work has applied large language models to generate specifications in Java Modeling Language (JML), reporting high verification pass rates. But does passing a verifier mean that the specification is actually correct and complete? In this work, we first conduct a comprehensive evaluation comparing classical and prompt-based approaches for automated JML specification synthesis. We then investigate whether prompt optimization can push synthesis quality further by evolving prompts through structured verification feedback. While optimization improves verifier pass rates, we find a clear performance ceiling. More critically, we propose Spec-Harness, an evaluation framework that measures specification correctness and completeness through symbolic verification, revealing that a large fraction of verifier-accepted specifications, including optimized ones, are in fact incorrect or incomplete, over- or under-constraining both inputs and outputs in ways invisible to the verifier. To push beyond this ceiling, we propose VeriAct, a verification-guided agentic framework that iteratively synthesizes and repairs specifications through a closed loop of LLM-driven planning, code execution, verification, and Spec-Harness feedback. Our experiments on two benchmark datasets show that VeriAct outperforms both prompt-based and prompt-optimized baselines, producing specifications that are not only verifiable but also correct and complete.

[AI-49] Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics

【速读】:该论文旨在解决能量基础模型(Energy-based Models, EBMs)在系统辨识中应用受限的问题,尤其是现有架构缺乏全局稳定性保证、无法排除不稳定模式的缺陷。其关键解决方案是提出一种具有稳定、耗散且吸收不变动力学(absorbing invariant dynamics)的EBM框架,通过引入包含动态可见层与静态隐藏层的混合架构,在温和假设下证明了吸收不变性,并将该稳定性保障扩展至端口-哈密顿(port-Hamiltonian)EBMs;同时,通过Clarke导数建立非光滑激活下的负能量耗散机制并推导径向无界新条件,揭示了标准EBMs中稳定性与表达能力之间的权衡关系,从而实现了结构可解释性、物理一致性与安全可证性的统一。

链接: https://arxiv.org/abs/2604.00277
作者: Simone Betteti,Luca Laurenti
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Energy-based models (EBMs) implement inference as gradient descent on a learned Lyapunov function, yielding interpretable, structure-preserving alternatives to black-box neural ODEs and aligning naturally with physical AI. Yet their use in system identification remains limited, and existing architectures lack formal stability guarantees that globally preclude unstable modes. We address this gap by introducing an EBM framework for system identification with stable, dissipative, absorbing invariant dynamics. Unlike classical global Lyapunov stability, absorbing invariance expands the class of stability-preserving architectures, enabling more flexible and expressive EBMs. We extend EBM theory to nonsmooth activations by establishing negative energy dissipation via Clarke derivatives and deriving new conditions for radial unboundedness, exposing a stability-expressivity tradeoff in standard EBMs. To overcome this, we introduce a hybrid architecture with a dynamical visible layer and static hidden layers, prove absorbing invariance under mild assumptions, and show that these guarantees extend to port-Hamiltonian EBMs. Experiments on metric-deformed multi-well and ring systems validate the approach, showcasing how our hybrid EBM architecture combines expressivity with sound and provable safety guarantees by design.

[AI-50] Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards

【速读】:该论文旨在解决现有模仿学习(apprenticeship learning)方法在在线教育环境中难以有效利用学生交互数据的问题,尤其是当学生行为并非最优或固定奖励下产生时。传统方法依赖于最优或近似最优的专家示范,但现实中学生行为具有动态性、探索性与错误性,这些“不完美”的示范常被视为噪声而被忽略。解决方案的关键在于提出HALIDE框架——一种基于分层模仿学习的算法,能够从不完美且未排序的学生示范中提取结构化信号,并通过层级建模学生行为抽象层次,同时显式捕捉学生奖励函数随时间演变的过程。HALIDE的核心创新是将示范质量纳入分层奖励推断机制,从而区分临时性错误与策略性偏差,并识别出通向更高阶学习目标的有意义进展,显著提升了对学生教学决策的预测准确性。

链接: https://arxiv.org/abs/2604.00258
作者: Md Mirajul Islam,Rajesh Debnath,Adittya Soukarjya Saha,Min Chi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AIED 2026

点击查看摘要

Abstract:While apprenticeship learning has shown promise for inducing effective pedagogical policies directly from student interactions in e-learning environments, most existing approaches rely on optimal or near-optimal expert demonstrations under a fixed reward. Real-world student interactions, however, are often inherently imperfect and evolving: students explore, make errors, revise strategies, and refine their goals as understanding develops. In this work, we argue that imperfect student demonstrations are not noise to be discarded, but structured signals-provided their relative quality is ranked. We introduce HALIDE, Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards, which not only leverages sub-optimal student demonstrations, but ranks them within a hierarchical learning framework. HALIDE models student behavior at multiple levels of abstraction, enabling inference of higher-level intent and strategy from suboptimal actions while explicitly capturing the temporal evolution of student reward functions. By integrating demonstration quality into hierarchical reward inference,HALIDE distinguishes transient errors from suboptimal strategies and meaningful progress toward higher-level learning goals. Our results show that HALIDE more accurately predicts student pedagogical decisions than approaches that rely on optimal trajectories, fixed rewards, or unranked imperfect demonstrations.

[AI-51] Softmax gradient policy for variance minimization and risk-averse multi armed bandits

【速读】:该论文旨在解决多臂赌博机(Multi-Armed Bandit, MAB)问题中的风险感知决策难题,即在传统目标为最大化期望收益的基础上,转向选择方差最小的臂以实现更稳定的结果,从而在平均收益与风险之间进行权衡。其解决方案的关键在于提出一种基于softmax策略参数化的新型算法,通过利用当前臂分布的两次独立抽样来构建目标函数的无偏估计,并在此基础上证明了算法在自然条件下的收敛性。该方法不仅适用于最小化方差的任务,还可推广至一般的风险感知优化场景。

链接: https://arxiv.org/abs/2604.00241
作者: Gabriel Turinici
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Algorithms for the Multi-Armed Bandit (MAB) problem play a central role in sequential decision-making and have been extensively explored both theoretically and numerically. While most classical approaches aim to identify the arm with the highest expected reward, we focus on a risk-aware setting where the goal is to select the arm with the lowest variance, favoring stability over potentially high but uncertain returns. To model the decision process, we consider a softmax parameterization of the policy; we propose a new algorithm to select the minimal variance (or minimal risk) arm and prove its convergence under natural conditions. The algorithm constructs an unbiased estimate of the objective by using two independent draws from the current’s arm distribution. We provide numerical experiments that illustrate the practical behavior of these algorithms and offer guidance on implementation choices. The setting also covers general risk-aware problems where there is a trade-off between maximizing the average reward and minimizing its variance.

[AI-52] MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文解码时面临的I/O瓶颈问题,即每次生成新token时都需要重新读取不断增长的键值缓存(KV cache),导致计算和带宽开销随上下文长度显著上升。现有加速方法如压缩、选择/淘汰策略虽能减少数据访问量,但会降低注意力机制的保真度(fidelity)或限制可访问内容范围,进而损害延迟回忆能力和长文本生成质量。其解决方案的关键在于提出MAC-Attention机制:通过匹配阶段对局部窗口内的预RoPE(Pre-RoPE)向量进行L2距离匹配识别语义相似的历史查询;修正阶段在匹配边界附近重新计算小范围注意力以校正复用结果;完整阶段则以数值稳定的方式将修正后的注意力与KV尾部的新鲜计算结果融合。该方法在命中匹配时保持恒定的计算和带宽复杂度,不依赖上下文长度,并且兼容IO感知内核、分页KV管理器及MQA/GQA架构,在保持全注意力精度的前提下实现高达99%的KV访问削减、60%以上的token生成延迟降低以及14.3倍的注意力阶段加速。

链接: https://arxiv.org/abs/2604.00235
作者: Jinghan Yao,Sam Adé Jacobs,Walid Krichene,Masahiro Tanaka,Dhabaleswar K Panda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: this https URL

[AI-53] Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

【速读】:该论文旨在解决逆KL散度(Reverse Kullback-Leibler, RKL)在大语言模型(Large Language Model, LLM)知识蒸馏中引入的结构性缺陷,即导致学生模型产生过度自信预测和非目标类别监督不足的问题。具体而言,RKL在优化过程中会持续推动目标类别的logit上升,即使学生模型已与教师模型对齐,从而削弱输出多样性;同时,其对非目标类别的监督较弱,造成尾部类别对齐效果差。解决方案的关键在于提出一种多样性感知的RKL(Diversity-aware RKL, DRKL),通过消除非目标梯度对目标logit的异常驱动效应,并增强对非目标类别的监督信号,从而在保留RKL优化优势的同时显著提升模型的多样性与整体对齐质量。

链接: https://arxiv.org/abs/2604.00223
作者: Hoang-Chau Luong,Dat Ba Tran,Lingwei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

[AI-54] Making Sense of AI Agents Hype: Adoption Architectures and Takeaways from Practitioners

【速读】:该论文旨在解决工业界在实际应用中如何设计和部署基于大语言模型(Large Language Model, LLM)的智能体系统(agentic systems)这一关键问题。其解决方案的关键在于对138场实践者会议演讲进行系统性分析,识别出企业在采用代理架构时的常见策略、模式及技术栈,并归纳出LLM驱动的智能体系统在不同应用场景中的实现与运维特征,从而为从业者提供可借鉴的设计范式与实践洞察。

链接: https://arxiv.org/abs/2604.00189
作者: Ruoyu Su,Matteo Esposito,Roberta Capuano,Rafiullah Omar,June Sallou,Henry Muccini,Davide Taibi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:To support practitioners in understanding how agentic systems are designed in real-world industrial practice, we present a review of practitioner conference talks on AI agents. We analyzed 138 recorded talks to examine how companies adopt agent-based architectures (Objective 1), identify recurring architectural strategies and patterns (Objective 2), and analyze application domains and technologies used to implement and operate LLM-driven agentic systems (Objective 3).

[AI-55] Agent ic AI and Occupational Displacement: A Multi-Regional Task Exposure Analysis of Emerging Labor Market Disruption

【速读】:该论文旨在解决当前劳动市场对生成式AI(Generative AI)影响评估不足的问题,特别是针对具备自主执行完整职业工作流能力的代理型人工智能系统(agentic AI systems)所带来的就业替代风险。传统自动化技术仅替代单一任务,而agentic AI可完成多步骤推理、工具调用与自主决策的全流程作业,其潜在替代范围远超现有基于任务层级的分析框架。解决方案的关键在于提出“代理任务暴露度”(Agentic Task Exposure, ATE)评分体系——一个基于O*NET任务数据算法计算的复合指标,融合AI能力评分、工作流覆盖因子及采纳速度参数,而非回归估计结果,从而更精准量化职业层面的替代风险。

链接: https://arxiv.org/abs/2604.00186
作者: Ravish Gupta,Saket Kumar
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN); Applications (stat.AP)
备注: 26 pages, 2 figures, 6 tables. Submitted to IMF-OECD-PIIE-World Bank Conference on Labor Markets and Structural Transformation 2026

点击查看摘要

Abstract:This paper extends the Acemoglu-Restrepo task exposure framework to address the labor market effects of agentic artificial intelligence systems: autonomous AI agents capable of completing entire occupational workflows rather than discrete tasks. Unlike prior automation technologies that substitute for individual subtasks, agentic AI systems execute end-to-end workflows involving multi-step reasoning, tool invocation, and autonomous decision-making, substantially expanding occupational displacement risk beyond what existing task-level analyses capture. We introduce the Agentic Task Exposure (ATE) score, a composite measure computed algorithmically from O*NET task data using calibrated adoption parameters–not a regression estimate–incorporating AI capability scores, workflow coverage factors, and logistic adoption velocity. Applying the ATE framework across five major US technology regions (Seattle-Tacoma, San Francisco Bay Area, Austin, New York, and Boston) over a 2025-2030 horizon, we find that 93.2% of the 236 analyzed occupations across six information-intensive SOC groups (financial, legal, healthcare, healthcare support, sales, and administrative/clerical) cross the moderate-risk threshold (ATE = 0.35) in Tier 1 regions by 2030, with credit analysts, judges, and sustainability specialists reaching ATE scores of 0.43-0.47. We simultaneously identify seventeen emerging occupational categories benefiting from reinstatement effects, concentrated in human-AI collaboration, AI governance, and domain-specific AI operations roles. Our findings carry implications for workforce transition policy, regional economic planning, and the temporal dynamics of labor market adjustment

[AI-56] NFC based inventory control system for secure and efficient communication

【速读】:该论文旨在解决传统库存控制系统中基于条形码(Barcode)的局限性问题,例如安全性低、易受攻击、易损且在特殊产品表面(如高温、冷冻、圆形或不规则形状物品)上可靠性差。解决方案的关键在于引入近场通信(Near Field Communication, NFC)技术,利用其高安全性、高效性和可靠性,构建一种基于被动式NFC标签的电子商店库存管理系统;当顾客购买商品时,通过收银处的NFC读写设备与商品上的被动式NFC标签进行短距离无线通信生成电子收据,从而实现更安全、稳定的库存管理流程。

链接: https://arxiv.org/abs/2604.00181
作者: Razi Iqbal,Awais Ahmad,Asfandyar Gillani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper brings up this idea of using Near Field Communication (NFC) for inventory control system instead of using traditional barcodes. NFC because of its high security, ease of use and efficiency can be very suitable for systems like inventory control. In traditional inventory control systems, each product has a barcode pasted on it, which is vulnerable to attacks as barcodes are open and have no security. Furthermore, barcodes are prone to damages and can be unreliable when pasted on different types of products e.g. hot and frozen products, circular shaped products and irregular shaped products like clothes etc. NFC on the other hand is very efficient, secure and reliable when it comes to short-range wireless communication. In this paper we will present our prototype for the inventory control system of an electronic store in which each product has a passive NFC tag pasted to it. When a customer buys a product the receipt of the product is generated using NFC between the NFC passive tag on the product and NFC enabled device (e.g. smart phone or reader) at the cash counter.

[AI-57] Unified Architecture Metamodel of Information Systems Developed by Generative AI

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的应用开发中,由于缺乏统一的架构框架而导致信息系统在不同表示层之间缺乏一致性与可重复性转换的问题,从而造成系统表示碎片化。解决方案的关键在于提出一个统一的架构元模型(metamodel),该模型覆盖高、中、低三层架构图(分别对应业务/领域理解层、系统架构层和开发者层架构),并通过“代码→文档→代码”的闭环转换机制实现结构化上下文驱动的生成流程。实验表明,该架构元模型能显著提升LLM生成内容的准确性、稳定性和可重复性,成为人与模型之间的有效接口;但需进一步优化架构图集以减少冗余并增强上下文编排能力。

链接: https://arxiv.org/abs/2604.00171
作者: Oleg Grynets,Vasyl Lyashkevych
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 22 pages, 13 figures, 12 tables, 28 references

点击查看摘要

Abstract:The rapid development of AI and LLMs has driven new methods of SDLC, in which a large portion of code, technical, and business documentation is generated automatically. However, since there is no single architectural framework that can provide consistent, repeatable transformations across different representation layers of information systems, such systems remain fragmented in their system representation. This study explores the problem of creating a unified architecture for LLM-oriented applications based on selected architectural frameworks by SMEs. A framework structure is proposed that covers some key types of architectural diagrams and supports a closed cycle of transformations, such as: “Code to Documentation to Code”. The key architectural diagrams are split equally between main architectural layers: high-layer (business and domain understanding), middle-layer (system architecture), and low-layer (developer-layer architecture). Each architectural layer still contains some abstraction layers, which make it more flexible and better fit the requirements of design principles and architectural patterns. The conducted experiments demonstrated the stable quality of generated documentation and code when using a structured architectural context in the form of architectural diagrams. The results confirm that the proposed unified architecture metamodel can serve as an effective interface between humans and models, improving the accuracy, stability, and repeatability of LLM generation. However, the selected set of architectural diagrams should be optimised to avoid redundancy between some diagrams, and some diagrams should be updated to represent extra contextual orchestration. This work demonstrates measurable improvements for a new generation of intelligent tools that automate the SDLC and enable a comprehensive architecture compatible with AI-driven development.

[AI-58] Neural-Assisted in-Motion Self-Heading Alignment

【速读】:该论文旨在解决自主海洋平台在任务初期对航向(heading)估计精度低且对齐时间长的问题。传统基于模型的方法(如双矢量分解和优化姿态分解)虽能实现较高精度,但需较长的对齐时间,限制了部署效率与导航准确性。其解决方案的关键在于提出一种端到端、无需先验模型的神经辅助框架,直接利用与传统方法相同的输入数据进行训练和推理,从而显著提升初始航向估计的准确性和收敛速度:实验表明,该方法平均绝对误差降低53%,对齐时间最多减少67%。

链接: https://arxiv.org/abs/2604.00168
作者: Zeev Yampolsky,Felipe O. Silva,Adriano Frutuoso,Itzik Klein
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 Pages, 10 Figures, 6 Tables

点击查看摘要

Abstract:Autonomous platforms operating in the oceans require accurate navigation to successfully complete their mission. In this regard, the initial heading estimation accuracy and the time required to achieve it play a critical role. The initial heading is traditionally estimated by model-based approaches employing orientation decomposition. However, methods such as the dual vector decomposition and optimized attitude decomposition achieve satisfactory heading accuracy only after long alignment times. To allow rapid and accurate initial heading estimation, we propose an end-to-end, model-free, neural-assisted framework using the same inputs as the model-based approaches. Our proposed approach was trained and evaluated on real-world dataset captured by an autonomous surface vehicle. Our approach shows a significant accuracy improvement over the model-based approaches achieving an average absolute error improvement of 53%. Additionally, our proposed approach was able to reduce the alignment time by up to 67%. Thus, by employing our proposed approach, the reduction in alignment time and improved accuracy allow for a shorter deployment time of an autonomous platform and increased navigation accuracy during the mission.

[AI-59] A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks

【速读】:该论文旨在解决仓库级(repository-level)自动程序修复中故障定位粒度对修复效果影响的问题。现有研究多将定位与修复分离,且未在理想定位条件下系统比较不同粒度(如函数级、文件级、行级)对修复成功率的影响,尤其缺乏在大规模代码库场景下的实证分析。其解决方案的关键在于提出一种框架,通过修改Agentless框架的定位阶段,引入真实定位数据作为上下文注入修复提示中,从而在假设定位准确的前提下,独立评估不同粒度对修复性能的影响。实验表明,在SWE-Bench-Mini数据集上,函数级粒度整体修复率最高,但具体最优粒度可能依赖于任务特性,为未来研究提供了可验证的基准和方法论支持。

链接: https://arxiv.org/abs/2604.00167
作者: Joseph Townsend,Chandresh Pravin,Kwun Ho Ngan,Matthieu Parizy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic program repair can be a challenging task, especially when resolving complex issues at a repository-level, which often involves issue reproduction, fault localization, code repair, testing and validation. Issues of this scale can be commonly found in popular GitHub repositories or datasets that are derived from them. Some repository-level approaches separate localization and repair into distinct phases. Where this is the case, the fault localization approaches vary in terms of the granularity of localization. Where the impact of granularity is explored to some degree for smaller datasets, not all isolate this issue from the separate question of localization accuracy by testing code repair under the assumption of perfect fault localization. To the best of the authors’ knowledge, no repository-scale studies have explicitly investigated granularity under this assumption, nor conducted a systematic empirical comparison of granularity levels in isolation. We propose a framework for performing such tests by modifying the localization phase of the Agentless framework to retrieve ground-truth localization data and include this as context in the prompt fed to the repair phase. We show that under this configuration and as a generalization over the SWE-Bench-Mini dataset, function-level granularity yields the highest repair rate against line-level and file-level. However, a deeper dive suggests that the ideal granularity may in fact be task dependent. This study is not intended to improve on the state-of-the-art, nor do we intend for results to be compared against any complete agentic frameworks. Rather, we present a proof of concept for investigating how fault localization may impact automatic code repair in repository-scale scenarios. We present preliminary findings to this end and encourage further research into this relationship between the two phases. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.00167 [cs.SE] (or arXiv:2604.00167v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.00167 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joe Townsend PhD [view email] [v1] Tue, 31 Mar 2026 19:16:43 UTC (981 KB)

[AI-60] Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals

【速读】:该论文旨在解决癫痫发作检测中深度学习模型缺乏可解释性和神经生理学相关性的问题。传统基于宽带脑电图(EEG)的方法虽然准确率较高,但难以揭示不同频率成分对癫痫发作的贡献机制。解决方案的关键在于提出一种频域感知的框架,通过将原始EEG信号分解为五个频段(delta、theta、alpha、低β和高β),并从每个频段提取11个判别特征,结合图卷积神经网络(GCN)建模电极间的空间依赖关系,从而实现高精度且具有频率特异性的癫痫发作检测。实验结果表明,该方法在CHB-MIT数据集上整体准确率达99.01%,且揭示了中频段具有最强的判别能力,显著提升了诊断精度与可解释性。

链接: https://arxiv.org/abs/2604.00163
作者: Ferdaus Anam Jibon,Fazlul Hasan Siddiqui,F. Deeba,Gahangir Hossain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.

[AI-61] Open Reliable and Collective: A Community-Driven Framework for Tool-Using AI Agents

【速读】:该论文旨在解决工具集成大语言模型(Tool-integrated LLMs)在实际应用中可靠性不足的问题,其核心瓶颈在于工具调用准确性(tool-use accuracy)与工具自身正确性(intrinsic tool accuracy)的双重影响,而现有研究多聚焦于前者。解决方案的关键在于提出OpenTools——一个由社区驱动的标准化工具箱,通过统一工具Schema、提供轻量级插件式封装、自动化测试套件和持续监控机制,显著提升工具的可靠性和可复现性;同时引入用户贡献协议和公开Web演示平台,使工具质量随社区反馈动态优化,实验证明该方案在多种代理架构下均能带来6%-22%的相对性能提升,凸显了内在工具准确性对整体系统效能的核心作用。

链接: https://arxiv.org/abs/2604.00137
作者: Hy Dang,Quang Dao,Meng Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Tool-integrated LLMs can retrieve, compute, and take real-world actions via external tools, but reliability remains a key bottleneck. We argue that failures stem from both tool-use accuracy (how well an agent invokes a tool) and intrinsic tool accuracy (the tool’s own correctness), while most prior work emphasizes the former. We introduce OpenTools, a community-driven toolbox that standardizes tool schemas, provides lightweight plug-and-play wrappers, and evaluates tools with automated test suites and continuous monitoring. We also release a public web demo where users can run predefined agents and tools and contribute test cases, enabling reliability reports to evolve as tools change. OpenTools includes the core framework, an initial tool set, evaluation pipelines, and a contribution protocol. Experiments and evaluations show improved end-to-end reproducibility and task performance; community-contributed, higher-quality task-specific tools deliver 6%-22% relative gains over an existing toolbox across multiple agent architectures on downstream tasks and benchmarks, highlighting the importance of intrinsic tool accuracy.

[AI-62] From Domain Understanding to Design Readiness: a playbook for GenAI-supported learning in Software Engineering

【速读】:该论文旨在解决软件工程课程中学生在短时间内快速掌握支持性知识领域(如领域理解与建模方法)的挑战,特别是如何有效利用生成式 AI (Generative AI) 辅助教学以提升学习效率和效果。其解决方案的关键在于设计并实施一个基于定制化 ChatGPT(GPT-3.5) Tutor 的教学干预,该 Tutor 依托于课程专属的知识库,并通过结构化提示(prompt)配置、教学流程优化以及质量反馈机制(如五维评分标准)来增强交互的准确性、相关性和教学价值,从而显著提升学生的自我效能感,尤其是在加密货币金融基础和领域驱动设计(Domain-Driven Design, DDD)的应用能力方面。

链接: https://arxiv.org/abs/2604.00120
作者: Rafal Wlodarski
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software engineering courses often require rapid upskilling in supporting knowledge areas such as domain understanding and modeling methods. We report an experience from a two-week milestone in a master’s course where 29 students used a customized ChatGPT (GPT-3.5) tutor grounded in a curated course knowledge base to learn cryptocurrency-finance basics and Domain-Driven Design (DDD). We logged all interactions and evaluated a 34.5% random sample of prompt-answer pairs (60/~174) with a five-dimension rubric (accuracy, relevance, pedagogical value, cognitive load, supportiveness), and we collected pre/post self-efficacy. Responses were consistently accurate and relevant in this setting: accuracy averaged 98.9% with no factual errors and only 2/60 minor inaccuracies, and relevance averaged 92.2%. Pedagogical value was high (89.4%) with generally appropriate cognitive load (82.78%), but supportiveness was low (37.78%). Students reported large pre-post self-efficacy gains for genAI-assisted domain learning and DDD application. From these observations we distill seventeen concrete teaching practices spanning prompt/configuration and course/workflow design (e.g., setting expected granularity, constraining verbosity, curating guardrail examples, adding small credit with a simple quality rubric). Within this single-course context, results suggest that genAI-supported learning can complement instruction in domain understanding and modeling tasks, while leaving room to improve tone and follow-up structure.

[AI-63] Beyond Symbolic Control: Societal Consequences of AI-Driven Workforce Displacement and the Imperative for Genuine Human Oversight Architectures

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)和机器人系统加速替代人类劳动力所引发的结构性社会转型中,治理机制与实际能力之间存在显著脱节的问题。核心在于识别出“名义上的”人类监督(nominal human oversight)与“真正的”人类监督(genuine human oversight)之间的关键差距:前者仅指形式上由人类拥有决策权,后者则要求人类具备认知可及性、技术能力和制度权威以实质性地理解、评估并干预AI输出。论文指出,当前主流治理框架如欧盟AI法案(EU AI Act)和NIST AI风险管理框架1.0均未充分考虑这一差异,导致AI治理存在根本性架构缺陷。解决方案的关键在于构建五项结构化要求来实现真正的“人类监督”,并强调在当前部署路径下,存在一个约10–15年的治理窗口期,若错过将可能造成社会、经济与制度层面的路径依赖锁定。

链接: https://arxiv.org/abs/2604.00081
作者: Richard J. Mitchell
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 23 pages, 23 references

点击查看摘要

Abstract:The accelerating displacement of human labor by artificial intelligence (AI) and robotic systems represents a structural transformation whose societal consequences extend far beyond conventional labor market analysis. This paper presents a systematic multi-domain examination of the likely effects on economic structure, psychological well-being, political stability, education, healthcare, and geopolitical order. We identify a critical and underexamined dimension of this transition: the governance gap between nominal human oversight of AI systems – where humans occupy positions of formal authority over AI decisions – and genuine human oversight, where those humans possess the cognitive access, technical capability, and institutional authority to meaningfully understand, evaluate, and override AI outputs. We argue that this distinction, largely absent from current governance frameworks including the EU AI Act and NIST AI Risk Management Framework 1.0, represents the primary architectural failure mode in deployed AI governance. The societal consequences of labor displacement intensify this problem by concentrating consequential AI decision-making among an increasingly narrow class of technical and capital actors. We propose five architectural requirements for genuine human oversight systems and characterize the governance window – estimated at 10-15 years – before current deployment trajectories risk path-dependent social, economic, and institutional lock-in.

[AI-64] Learning to Play Blackjack: A Curriculum Learning Perspective

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在复杂环境中效率低、性能差的问题。其解决方案的关键在于提出一种新颖的框架,利用大语言模型(Large Language Model, LLM)动态生成动作课程(curriculum),使代理能够逐个引入和学习可用动作,从而实现更高效、更稳定的训练过程。通过在8副牌的黑杰克(Blackjack)模拟环境中验证,该方法显著提升了Tabular Q-Learning和深度Q网络(Deep Q-Network, DQN)代理的性能,例如DQN平均胜率从43.97%提升至47.41%,同时大幅缩短训练时间,证明LLM引导的课程设计可有效构建更鲁棒且高效的RL代理。

链接: https://arxiv.org/abs/2604.00076
作者: Amirreza Alasti,Efe Erdal,Yücel Celik,Theresa Eimer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as an oral presentation at the International Conference on Distributed Artificial Intelligence (DAI 2025). 16 pages, 7 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent’s average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent’s full training completing faster than the baseline’s evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.

[AI-65] Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在持续自我改进过程中如何维持可靠安全监督的问题,特别是针对基于分类器的安全门(classifier-based safety gates)是否能在高维空间中保持有效性。研究发现,无论使用何种分类模型(如 MLP、SVM、随机森林等)或现有安全强化学习方法(如 CPO、Lyapunov 控制、安全屏蔽),均无法满足安全自提升的双重条件,且在 MuJoCo 基准测试中也失败,表明这是结构性不可能问题,而非训练不足所致。其关键解决方案是引入 Lipschitz 球验证机制(Lipschitz ball verifier),通过可证明的解析边界实现零误接受(delta=0),并利用球链技术(ball chaining)支持无界参数空间遍历,在不引发安全违规的前提下显著提升性能(如 MuJoCo Reacher-v4 上 +4.31 奖励改善,LLM LoRA 微调中 234 倍半径跨越)。该方法在高维场景下具有可扩展性,并可通过分组组合验证进一步扩大有效半径。

链接: https://arxiv.org/abs/2604.00072
作者: Arsenios Scrivens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 21 pages, 9 figures. Companion theory paper: doi: https://doi.org/10.5281/zenodo.19237451

点击查看摘要

Abstract:Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations – spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks – all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail – including the NP-optimal test and MLPs with 100% training accuracy – demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in 84, 240, 768, 2688, 5760, 9984, 17408 using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants. Comments: 21 pages, 9 figures. Companion theory paper: doi:https://doi.org/10.5281/zenodo.19237451 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2604.00072 [cs.LG] (or arXiv:2604.00072v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00072 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19237566 Focus to learn more DOI(s) linking to related resources

[AI-66] Perspective: Towards sustainable exploration of chemical spaces with machine learning

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在分子与材料科学中应用时所面临的资源消耗巨大、可持续性不足的问题,特别是在从量子力学(Quantum-Mechanical, QM)数据生成、模型训练到自动化自驱动研究流程的全链条中,如何平衡计算效率与科学可靠性。其解决方案的关键在于构建多层次、高效率的AI驱动发现体系:一是通过通用机器学习(General-Purpose Machine Learning, ML)模型、多保真度(Multi-Fidelity)方法、模型蒸馏(Model Distillation)和主动学习(Active Learning)等策略提升计算效率;二是引入物理约束的分层工作流,将高精度QM方法仅用于关键环节,而广泛使用快速ML代理模型以降低整体资源开销;三是强调合成可行性与多目标优化设计准则的融合,确保算法输出具备实际可实施性;四是推动开放数据、可复用工作流和领域专用AI系统的发展,从而实现单位计算资源下的最大科学价值。

链接: https://arxiv.org/abs/2604.00069
作者: Leonardo Medrano Sandonas,David Balcells,Anton Bochkarev,Jacqueline M. Cole,Volker L. Deringer,Werner Dobrautz,Adrian Ehrenhofer,Thorben Frank,Pascal Friederich,Rico Friedrich,Janine George,Luca Ghiringhelli,Alejandra Hinostroza Caldas,Veronika Juraskova,Hannes Kneiding,Yury Lysogorskiy,Johannes T. Margraf,Hanna Türk,Anatole von Lilienfeld,Milica Todorović,Alexandre Tkatchenko,Mariana Rossi,Gianaurelio Cuniberti
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 44 pages, 8 figures, SusML workshop

点击查看摘要

Abstract:Artificial intelligence is transforming molecular and materials science, but its growing computational and data demands raise critical sustainability challenges. In this Perspective, we examine resource considerations across the AI-driven discovery pipeline–from quantum-mechanical (QM) data generation and model training to automated, self-driving research workflows–building on discussions from the ``SusML workshop: Towards sustainable exploration of chemical spaces with machine learning’’ held in Dresden, Germany. In this context, the availability of large quantum datasets has enabled rigorous benchmarking and rapid methodological progress, while also incurring substantial energy and infrastructure costs. We highlight emerging strategies to enhance efficiency, including general-purpose machine learning (ML) models, multi-fidelity approaches, model distillation, and active learning. Moreover, incorporating physics-based constraints within hierarchical workflows, where fast ML surrogates are applied broadly and high-accuracy QM methods are used selectively, can further optimize resource use without compromising reliability. Equally important is bridging the gap between idealized computational predictions and real-world conditions by accounting for synthesizability and multi-objective design criteria, which is essential for practical impact. Finally, we argue that sustainable progress will rely on open data and models, reusable workflows, and domain-specific AI systems that maximize scientific value per unit of computation, enabling efficient and responsible discovery of technological materials and therapeutics.

[AI-67] mporal Memory for Resource-Constrained Agents : Continual Learning via Stochastic Compress-Add-Smooth

【速读】:该论文旨在解决持续学习(continual learning)中记忆容量受限下的遗忘问题,即智能体在顺序执行任务时如何在固定记忆预算下保留历史经验而不发生灾难性遗忘。其核心挑战在于传统方法依赖参数更新或存储数据,易受参数干扰或存储开销限制。解决方案的关键在于提出一种基于随机过程的新型记忆框架:将记忆建模为定义在重放区间 [0,1] 上的桥扩散(Bridge Diffusion),其中终端边缘分布编码当前状态,中间边缘分布编码过去状态;通过“压缩-添加-平滑”(Compress–Add–Smooth, CAS)三步递归机制实现新经验的高效整合。该方法无需反向传播、不存储原始数据、不依赖神经网络,计算复杂度为每日 O(LKd2)O(LKd^2) 次浮点运算,适用于轻量级控制器硬件。遗忘在此框架中源于时间上的有损压缩——即在固定分段数 LL 的约束下,用粗粒度协议重新近似细粒度时间轨迹,从而实现了对遗忘机制、速率和形式的数学精确刻画。

链接: https://arxiv.org/abs/2604.00067
作者: Michael Chertkov
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 33 pages, 22 figures

点击查看摘要

Abstract:An agent that operates sequentially must incorporate new experience without forgetting old experience, under a fixed memory budget. We propose a framework in which memory is not a parameter vector but a stochastic process: a Bridge Diffusion on a replay interval [0,1] , whose terminal marginal encodes the present and whose intermediate marginals encode the past. New experience is incorporated via a three-step \emphCompress–Add–Smooth (CAS) recursion. We test the framework on the class of models with marginal probability densities modeled via Gaussian mixtures of fixed number of components~ K in d dimensions; temporal complexity is controlled by a fixed number~ L of piecewise-linear protocol segments whose nodes store Gaussian-mixture states. The entire recursion costs O(LKd^2) flops per day – no backpropagation, no stored data, no neural networks – making it viable for controller-light hardware. Forgetting in this framework arises not from parameter interference but from lossy temporal compression: the re-approximation of a finer protocol by a coarser one under a fixed segment budget. We find that the retention half-life scales linearly as a_1/2\approx c,L with a constant c1 that depends on the dynamics but not on the mixture complexity~ K , the dimension~ d , or the geometry of the target family. The constant~ c admits an information-theoretic interpretation analogous to the Shannon channel capacity. The stochastic process underlying the bridge provides temporally coherent movie'' replay -- compressed narratives of the agent's history, demonstrated visually on an MNIST latent-space illustration. The framework provides a fully analytical Ising model’’ of continual learning in which the mechanism, rate, and form of forgetting can be studied with mathematical precision. Comments: 33 pages, 22 figures Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2604.00067 [cs.LG] (or arXiv:2604.00067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-68] owards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning ICME2026

【速读】:该论文旨在解决现有自动足球解说生成方法在真实直播场景中表现不足的问题,具体表现为生成内容缺乏对球员和球队等实体的准确指代、存在上下文依赖性错误以及缺少比赛事件的统计学洞察。其解决方案的关键在于提出一个两阶段模型 GameSight,将足球解说生成任务建模为知识增强的视觉推理问题:第一阶段通过视觉推理实现匿名实体与细粒度视觉及上下文信息的对齐,第二阶段结合外部历史统计数据和迭代更新的内部比赛状态信息对解说进行知识增强,从而提升实体识别准确性与解说内容的上下文相关性及结构合理性。

链接: https://arxiv.org/abs/2604.00057
作者: Zeyu Jin,Xiaoyu Qin,Songtao Zhou,Kaifeng Yun,Jia Jia
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Accepted by ICME 2026

点击查看摘要

Abstract:Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real-world live televised commentary, as it contains anonymous entities, context-dependent errors and lacks statistical insights of the game events. To bridge the gap, we propose GameSight, a two-stage model to address soccer commentary generation as a knowledge-enhanced visual reasoning task, enabling live-televised-like knowledgeable commentary with accurate reference to entities (players and teams). GameSight starts by performing visual reasoning to align anonymous entities with fine-grained visual and contextual analysis. Subsequently, the entity-aligned commentary is refined with knowledge by incorporating external historical statistics and iteratively updated internal game state information. Consequently, GameSight improves the player alignment accuracy by 18.5% on SN-Caption-test-align dataset compared to Gemini 2.5-pro. Combined with further knowledge enhancement, GameSight outperforms in segment-level accuracy and commentary quality, as well as game-level contextual relevance and structural composition. We believe that our work paves the way for a more informative and engaging human-centric experience with the AI sports application. Demo Page: this https URL

[AI-69] he Energy Footprint of LLM -Based Environmental Analysis: LLM s and Domain Products

【速读】:该论文旨在解决领域特定检索增强生成(Retrieval-Augmented Generation, RAG)系统在气候分析等专业场景中推理阶段能耗问题,特别是其与通用大语言模型(Large Language Models, LLMs)直接调用相比的能效差异。解决方案的关键在于对两个气候领域专用聊天机器人(ChatNetZero 和 ChatNDC)的工作流进行细粒度分解,分别量化检索、生成和幻觉检查等组件的能耗,并结合实际用户查询、不同时段及地理访问位置进行多维度测试,从而揭示RAG系统设计对能源消耗和输出质量的非线性影响。

链接: https://arxiv.org/abs/2604.00053
作者: Alicia Bao,Jiamian He,Angel Hsu,Diego Manya, Ji (James)Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used in domain-specific applications, including climate change and environmental research, understanding their energy footprint has become an important concern. The growing adoption of retrieval-augmented (RAG) systems for climate-domain specific analysis raises a key question: how does the energy consumption of domain-specific RAG workflows compare with that of direct generic LLM usage? Prior research has focused on standalone model calls or coarse token-based estimates, while leaving the energy implications of deployed application workflows insufficiently understood. In this paper, we assess the inference-time energy consumption of two LLM-based climate analysis chatbots (ChatNetZero and ChatNDC) compared to the generic GPT-4o-mini model. We estimate energy use under actual user queries by decomposing each workflow into retrieval, generation, and hallucination-checking components. We also test across different times of day and geographic access locations. Our results show that the energy consumption of domain-specific RAG systems depends strongly on their design. More agentic pipelines substantially increase inference-time energy use, particularly when used for additional accuracy or verification checks, although they may not yield proportional gains in response quality. While more research is needed to further test these initial findings more robustly across models, environments and prompting structures, this study provides a new understanding on how the design of domain-specific LLM products affects both the energy footprint and quality of output.

[AI-70] ask-Centric Personalized Federated Fine-Tuning of Language Models

【速读】:该论文旨在解决个性化联邦学习(Personalized Federated Learning, pFL)在面对两个关键挑战时的性能瓶颈:一是泛化能力不足,即客户端在遇到未见过的任务或数据分布变化时模型表现下降;二是客户端内任务干扰,即单个客户端的数据包含多种分布,导致本地训练过程中不同任务间相互干扰。解决方案的关键在于提出FedRouter,一种基于聚类的pFL框架,其核心创新是将个性化建模从“按客户端”转向“按任务”,通过两种聚类机制实现:(1)局部聚类将适配器(adapter)与具体任务样本关联,(2)全局聚类将来自不同客户端的相似适配器聚合以构建以任务为中心的个性化模型。此外,引入评估路由机制,在推理阶段根据已建立的聚类结果动态选择最优适配器,从而显著提升模型在任务干扰和跨任务泛化场景下的鲁棒性与性能。

链接: https://arxiv.org/abs/2604.00050
作者: Gabriel U. Talasso,Meghdad Kurmanji,Allan M. de Souza,Nicholas D. Lane,Leandro A. Villas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising technique for training language models on distributed and private datasets of diverse tasks. However, aggregating models trained on heterogeneous tasks often degrades the overall performance of individual clients. To address this issue, Personalized FL (pFL) aims to create models tailored for each client’s data distribution. Although these approaches improve local performance, they usually lack robustness in two aspects: (i) generalization: when clients must make predictions on unseen tasks, or face changes in their data distributions, and (ii) intra-client tasks interference: when a single client’s data contains multiple distributions that may interfere with each other during local training. To tackle these two challenges, we propose FedRouter, a clustering-based pFL that builds specialized models for each task rather than for each client. FedRouter uses adapters to personalize models by employing two clustering mechanisms to associate adapters with specific tasks. A local clustering that associate adapters with task data samples and a global one that associates similar adapters from different clients to construct task-centric personalized models. Additionally, we propose an evaluation router mechanism that routes test samples to the best adapter based on the created clusters. Experiments comparing our method with existing approaches across a multitask dataset, FedRouter demonstrate strong resilience in these challenging scenarios performing up to 6.1% relatively better under tasks interference and up to 136% relative improvement under generalization evaluation.

[AI-71] DriftScript: A Domain-Specific Language for Programming Non-Axiomatic Reasoning Agents

【速读】:该论文旨在解决非公理推理系统(Non-Axiomatic Reasoning Systems, NARS)中标准输入语言Narsese因符号密度高、标点重载和隐式约定导致的可读性差、编写与维护困难的问题。解决方案的关键在于提出并实现DriftScript——一种类Lisp的领域特定语言(Domain-Specific Language, DSL),其通过关键字驱动的S表达式替代Narsese的符号语法,支持NARS逻辑层级1至8的主要句法和术语形式(如继承、时序蕴含、变量量化等),并通过一个四阶段无依赖编译器将其转换为Narsese。该编译器结构清晰、代码精简(C99实现,共1941行),并与DriftNARS引擎集成,支持外部系统通过四种结构化回调和HTTP操作注册表进行交互,从而构建感知-推理-行动闭环,提升自主代理的开发效率与实用性。

链接: https://arxiv.org/abs/2604.00043
作者: Seamus Brady
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-Axiomatic Reasoning Systems (NARS) provide a framework for building adaptive agents that operate under insufficient knowledge and resources. However, the standard input language, Narsese, poses a usability barrier: its dense symbolic notation, overloaded punctuation, and implicit conventions make programs difficult to read, write, and maintain. We present DriftScript, a Lisp-like domain-specific language that compiles to Narsese. DriftScript provides source-level constructs covering the major sentence and term forms used in Non-Axiomatic Logic (NAL) levels 1 through 8, including inheritance, temporal implication, variable quantification, sequential conjunction, and operation invocation, while replacing symbolic syntax with readable keyword-based S-expressions. The compiler is a zero-dependency, four-stage pipeline implemented in 1,941 lines of C99. When used with the DriftNARS engine, DriftScript programs connect to external systems through four structured callback types and an HTTP operation registry, enabling a sense-reason-act loop for autonomous agents. We describe the language design and formal grammar, detail the compiler architecture, and evaluate the compiler through a 106-case test suite, equivalence testing against hand-written Narsese, a NAL coverage analysis, structural readability metrics, and compilation benchmarks. The source code is available at this https URL. This paper focuses on the design and implementation of the DriftScript language and its embedding into DriftNARS, rather than on new inference algorithms for NARS itself.

[AI-72] Quantifying Gender Bias in Large Language Models : When ChatGPT Becomes a Hiring Manager

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在招聘决策中可能延续并放大社会性别偏见的问题。其核心发现表明,尽管LLM对女性候选人的录用倾向更高且认为其资质更优,但仍倾向于给予较低薪酬,显示出隐性性别歧视。解决方案的关键在于通过提示工程(prompt engineering)来缓解此类偏见,从而实现更公平的自动化招聘流程。

链接: https://arxiv.org/abs/2604.00011
作者: Nina Gerszberg,Janka Hamori,Andrew Lo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing prominence of large language models (LLMs) in daily life has heightened concerns that LLMs exhibit many of the same gender-related biases as their creators. In the context of hiring decisions, we quantify the degree to which LLMs perpetuate societal biases and investigate prompt engineering as a bias mitigation technique. Our findings suggest that for a given resumé, an LLM is more likely to hire a female candidate and perceive them as more qualified, but still recommends lower pay relative to male candidates.

[AI-73] Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

【速读】:该论文旨在解决传统机器学习方法在经济计量建模中难以融合先验理论知识与高维数据驱动估计的问题,尤其在结构参数识别、泛化能力及分布偏移下的稳定性方面存在局限。其解决方案的关键在于提出结构知识引导的神经网络(Structured-Knowledge-Informed Neural Networks, SKINNs),通过将理论、模拟、历史经验或跨领域知识以可微约束的形式嵌入神经网络函数逼近过程中,实现对神经网络参数与经济上具有解释意义的结构参数的联合估计。SKINNs 在单一优化问题中强制执行理论一致性,不仅作用于观测数据,还通过配点法(collocation)扩展至更广输入域,从而统一了函数型广义矩估计(functional GMM)、贝叶斯更新、迁移学习、物理信息神经网络(PINNs)和代理建模等方法,并保证了估计量的一致性、渐近正态性及根N收敛速度。

链接: https://arxiv.org/abs/2604.00987
作者: Yi Cao,Zexun Chen,Lin William Cong,Heqing Shi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.

[AI-74] Procela: Epistemic Governance in Mechanistic Simulations Under Structural Uncertainty

【速读】:该论文旨在解决传统机制模拟(mechanistic simulation)在面对结构不确定性时的局限性问题,尤其是在因果结构存在争议或无法识别的情境下(如抗微生物耐药性(AMR)传播中接触、环境与选择等不同本体论竞争的情形)。传统方法假设变量、因果关系和决策规则是静态固定的,无法适应动态变化的现实复杂性。解决方案的关键在于提出 Procela 框架:其核心创新是将变量视为具有完整假设记忆的“认识论权威”(epistemic authorities),机制以因果单元形式编码多个竞争本体论(ontologies),并通过治理模块实时观测认识信号并动态变异系统拓扑结构,从而实现对模拟假设自身的测试与演化。这一设计使模拟不仅建模世界,也建模自身的建模过程,显著提升了在结构性不确定性下的适应能力。

链接: https://arxiv.org/abs/2604.00675
作者: Kinson Vernet
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Mechanistic simulations typically assume fixed ontologies: variables, causal relationships, and resolution policies are static. This assumption fails when the true causal structure is contested or unidentifiable-as in antimicrobial resistance (AMR) spread, where contact, environmental, and selection ontologies compete. We introduce Procela, a Python framework where variables act as epistemic authorities that maintain complete hypothesis memory, mechanisms encode competing ontologies as causal units, and governance observes epistemic signals and mutates system topology at runtime. This is the first framework where simulations test their own assumptions. We instantiate Procela for AMR in a hospital network with three competing families. Governance detects coverage decay, policy fragility, and runs structural probes. Results show 20.4% error reduction and 69% cumulative regret improvement over baseline. All experiments are reproducible with full auditability. Procela establishes a new paradigm: simulations that model not only the world but their own modeling process, enabling adaptation under structural uncertainty.

[AI-75] Prompt-Guided Prefiltering for VLM Image Compression ICME2026

【速读】:该论文旨在解决在视觉语言模型(Vision-Language Models, VLMs)应用中,传统以人类感知为中心的图像压缩方法因保留大量与任务无关的细节而导致压缩效率低下的问题。现有面向机器的图像编码(Image Coding for Machines, ICM)方法亦受限于固定下游任务假设,难以适配由文本提示驱动、目标开放的VLM场景。解决方案的关键在于提出一种轻量级、即插即用的提示引导预过滤模块(prompt-guided prefiltering module),该模块能根据文本提示识别图像中与下游任务最相关的区域,在保留关键信息的同时对非相关区域进行平滑处理,从而提升压缩效率;该模块与编解码器无关,可无缝集成至传统或学习型编码器之前,实现在保持VQA等任务准确率不变的前提下,平均码率降低25%-50%。

链接: https://arxiv.org/abs/2604.00314
作者: Bardia Azizian,Ivan V. Bajic
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures. Accepted to IEEE ICME 2026. Code: this https URL

点击查看摘要

Abstract:The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at this https URL.

[AI-76] GenoBERT: A Language Model for Accurate Genotype Imputation

【速读】:该论文旨在解决传统基因型填补(genotype imputation)方法中存在的祖先偏倚(ancestry bias)和稀有变异(rare variant)准确性不足的问题。其解决方案的关键在于提出了一种基于Transformer架构的无参考面板(reference-free)框架GenoBERT,通过将相位化的基因型进行分词(tokenization),并利用自注意力机制捕捉短程和长程连锁不平衡(linkage disequilibrium, LD)依赖关系,从而在不同人群和多种缺失率(5%-50%)条件下实现高精度填补,尤其在实际应用中(缺失率≤25%)达到接近0.98的r²值,并在极端缺失情况下(50%)仍保持稳定性能(r² > 0.90)。

链接: https://arxiv.org/abs/2604.00058
作者: Lei Huang,Chuan Qiu,Kuan-Jui Su,Anqi Liu,Yun Gong,Weiqiang Lin,Lindong Jiang,Chen Zhao,Meng Song,Jeffrey Deng,Qing Tian,Zhe Luo,Ping Gong,Hui Shen,Chaoyang Zhang,Hong-Wen Deng
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high overall imputation accuracy ( r^2 approx 0.98 ) across datasets, and maintains robust performance ( r^2 0.90 ) even at 50% missingness. Experimental results across different ancestries confirm consistent gains across datasets, with resilience to small sample sizes and weak LD. A 128-SNP (single-nucleotide polymorphism) context window (approximately 100 Kb) is validated through LD-decay analyses as sufficient to capture local correlation structures. By eliminating reference-panel dependence while preserving high accuracy, GenoBERT provides a scalable and robust solution for genotype imputation and a foundation for downstream genomic modeling.

[AI-77] Whittaker-Henderson smoother for long satellite image time series interpolation

【速读】:该论文旨在解决Whittaker平滑器在处理卫星图像时间序列(SITS)时的两个关键局限性:一是平滑参数需对每个像素单独调优,二是标准公式假设噪声同方差(homoscedastic noise),导致时间维度上平滑强度均匀,无法适应局部噪声变化。解决方案的核心在于将Whittaker平滑器建模为可微神经层,通过神经网络自动推断平滑参数,并引入时变正则化项以应对异方差噪声(heteroscedastic noise),从而实现沿时间轴自适应的局部平滑。此外,为支持大规模处理,提出基于Cholesky分解的稀疏、内存高效且完全可微的实现方式,显著提升GPU上的计算效率与内存利用率。

链接: https://arxiv.org/abs/2604.00048
作者: Mathieu Fauvel(CESBIO)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whittaker smoother is a widely adopted solution to pre-process satellite image time series. Yet, two key limitations remain: the smoothing parameter must be tuned individually for each pixel, and the standard formulation assumes homoscedastic noise, imposing uniform smoothing across the temporal dimension. This paper addresses both limitations by casting the Whittaker smoother as a differentiable neural layer, in which the smoothing parameter is inferred by a neural network. The framework is further extended to handle heteroscedastic noise through a time-varying regularization, allowing the degree of smoothing to adapt locally along the time series. To enable large-scale processing, a sparse, memory-efficient, and fully differentiable implementation is proposed, exploiting the symmetric banded structure of the underlying linear system via Cholesky factorization. Benchmarks on GPU demonstrate that this implementation substantially outperforms standard dense linear solvers, both in speed and memory consumption. The approach is validated on SITS acquired over the French metropolitan territory between 2016 and 2024. Results confirm the feasibility of large-scale heteroscedastic Whittaker smoothing, though reconstruction differences with the homoscedastic baseline remain limited, suggesting that the transformer architecture used for smoothing parameter estimation may lack the temporal acuity needed to capture abrupt noise variations such as singleday cloud contamination.

[AI-78] When and Where: A Model Hippocampal Network Unifies Formation of Time Cells and Place Cells

【速读】:该论文旨在解决 hippocampal place cells(位置细胞)与 time cells(时间细胞)虽共享神经底物却长期被建模为功能与机制迥异的两类神经表征的问题——前者被视为连续吸引子(continuous attractor),后者则被解释为漏积分器(leaky integrator)。其解决方案的关键在于提出一个统一的递归神经网络(Recurrent Neural Network, RNN)模型,将海马CA3区域建模为预测自动编码器(predictive autoencoder),该模型通过训练以重建部分遮蔽的“经验向量”(experience vectors),自然涌现出两种细胞类型:在空间导航任务中生成稳定的吸引子样位置场(place fields),而在处理时序结构输入时则产生逐渐扩展的序列激活模式,重现时间细胞特性。通过调节时空输入模式,隐藏单元可在时间细胞样与位置细胞样表征间平滑过渡,揭示二者具有共同起源但受任务驱动差异表达的本质。

链接: https://arxiv.org/abs/2604.00036
作者: Qiaorong S. Yu,Zhaoze Wang,Vijay Balasubramanian
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Biological Physics (physics.bio-ph)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Hippocampal place and time cells encode spatial and temporal aspects of experience. Both have the same neural substrate, but have been modeled as having different functions and mechanistic origins, place cells as continuous attractors, and time cells as leaky integrators. Here, we show that both types emerge from two dynamical regimes of a single recurrent network (RNN) modeling hippocampal CA3 as a predictive autoencoder. The network receives simulated, partially occluded experience vectors" containing spatial patterns (location-specific activity sampled during environmental traversal) and/or temporal patterns (correlated activity pairs separated by void" intervals), and is trained to reconstruct missing input. During spatial navigation, the network generates stable attractor-like place fields. But trained on temporally structured inputs, the network produces sequentially broadened fields, recapitulating time cells. By varying spatio-temporal input patterning, we observe hidden units transition smoothly between time cell-like and place cell-like representations. These results suggest a shared origin, but task-driven difference, between place and time cells.

[AI-79] Agent ic AI – Physicist Collaboration in Experimental Particle Physics: A Proof-of-Concept Measurement with LEP Open Data

【速读】:该论文旨在解决高能物理实验中精确测量喷注(thrust)分布的自动化与效率问题,同时探索AI在理论-实验闭环中的协同作用。其解决方案的关键在于:利用AI代理(OpenAI Codex和Anthropic Claude)在专家指导下完成全部分析流程,包括数据处理、蒙特卡洛校正及迭代贝叶斯去卷积(Iterative Bayesian Unfolding),从而获得全修正后的喷注谱。这一方法不仅提升了测量精度,也为构建AI驱动的科学发现循环提供了可扩展范式。

链接: https://arxiv.org/abs/2603.05735
作者: Anthony Badea,Yi Chen,Marcello Maggi,Yen-Jie Lee,Electron-Positron Alliance
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph)
备注:

点击查看摘要

Abstract:We present an AI agentic measurement of the thrust distribution in e^+e^- collisions at \sqrts=91.2 ~GeV using archived ALEPH data. The analysis and all note writing is carried out entirely by AI agents (OpenAI Codex and Anthropic Claude) under expert physicist direction. A fully corrected spectrum is obtained via Iterative Bayesian Unfolding and Monte Carlo based corrections. This work represents a step toward a theory-experiment loop in which AI agents assist with experimental measurements and theoretical calculations, and synthesize insights by comparing the results, thereby accelerating the cycle that drives discovery in fundamental physics. Our work suggests that precision physics, leveraging the open LEP data and advanced theoretical landscape, provides an ideal testing ground for developing advanced AI systems for scientific applications.

机器学习

[LG-0] NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

链接: https://arxiv.org/abs/2604.01175
作者: Prasanjit Dey,Soumyabrata Dev,Angela Meyer,Bianca Schoen-Phelan
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review

点击查看摘要

Abstract:Accurate air quality forecasting is crucial for protecting public health and guiding environmental policy, yet it remains challenging due to nonlinear spatiotemporal dynamics, wind-driven transport, and distribution shifts across regions. Physics-based models are interpretable but computationally expensive and often rely on restrictive assumptions, whereas purely data-driven models can be accurate but may lack robustness and calibrated uncertainty. To address these limitations, we propose Neural Dynamic Diffusion-Advection Fields (NeuroDDAF), a physics-informed forecasting framework that unifies neural representation learning with open-system transport modeling. NeuroDDAF integrates (i) a GRU-Graph Attention encoder to capture temporal dynamics and wind-aware spatial interactions, (ii) a Fourier-domain diffusion-advection module with learnable residuals, (iii) a wind-modulated latent Neural ODE to model continuous-time evolution under time-varying connectivity, and (iv) an evidential fusion mechanism that adaptively combines physics-guided and neural forecasts while quantifying uncertainty. Experiments on four urban datasets (Beijing, Shenzhen, Tianjin, and Ancona) across 1-3 day horizons show that NeuroDDAF consistently outperforms strong baselines, including AirPhyNet, achieving up to 9.7% reduction in RMSE and 9.4% reduction in MAE on long-term forecasts. On the Beijing dataset, NeuroDDAF attains an RMSE of 41.63 \mu g/m ^3 for 1-day prediction and 48.88 \mu g/m ^3 for 3-day prediction, representing the best performance among all compared methods. In addition, NeuroDDAF improves cross-city generalization and yields well-calibrated uncertainty estimates, as confirmed by ensemble variance analysis and case studies under varying wind conditions.

[LG-1] Safe learning-based control via function-based uncertainty quantification

链接: https://arxiv.org/abs/2604.01173
作者: Abdullah Tokmak,Toni Karvonen,Thomas B. Schön,Dominik Baumann
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Under review for CDC 2026

点击查看摘要

Abstract:Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.

[LG-2] Bridging the Simulation-to-Experiment Gap with Generative Models using Adversarial Distribution Alignment

链接: https://arxiv.org/abs/2604.01169
作者: Kai Nelson,Tobias Kreiman,Sergey Levine,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system’s full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions – initially trained on a simulated Boltzmann distribution – with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at this https URL.

[LG-3] Reasoning Shift: How Context Silently Shortens LLM Reasoning

链接: https://arxiv.org/abs/2604.01161
作者: Gleb Rodionov
类目: Machine Learning (cs.LG)
*备注: Preprint, work in progress

点击查看摘要

Abstract:Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

[LG-4] Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas

链接: https://arxiv.org/abs/2604.01153
作者: Xiangpeng Li,Yu-Hsuan Ho,Sam D Brody,Ali Mostafavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper argues that AI-enabled analysis of street-view imagery, complemented by performance-gated machine-learning imputation, provides a viable pathway for generating building-specific elevation data at regional scale for flood risk assessment. We develop and apply a three-stage pipeline across 18 areas of interest (AOIs) in Texas that (1) extracts LFE and the height difference between street grade and the lowest floor (HDSL) from Google Street View imagery using the Elev-Vision framework, (2) imputes missing HDSL values with Random Forest and Gradient Boosting models trained on 16 terrain, hydrologic, geographic, and flood-exposure features, and (3) integrates the resulting elevation dataset with Fathom 1-in-100 year inundation surfaces and USACE depth-damage functions to estimate property-specific interior flood depth and expected loss. Across 12,241 residential structures, street-view imagery was available for 73.4% of parcels and direct LFE/HDSL extraction was successful for 49.0% (5,992 structures). Imputation was retained for 13 AOIs where cross-validated performance was defensible, with selected models achieving R suqre values from 0.159 to 0.974; five AOIs were explicitly excluded from prediction because performance was insufficient. The results show that street-view-based elevation mapping is not universally available for every property, but it is sufficiently scalable to materially improve regional flood-risk characterization by moving beyond hazard exposure to structure-level estimates of interior inundation and expected damage. Scientifically, the study advances LFE estimation from a pilot-scale proof of concept to a regional, end-to-end workflow. Practically, it offers a replicable framework for jurisdictions that lack comprehensive Elevation Certificates but need parcel-level information to support mitigation, planning, and flood-risk management.

[LG-5] Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking

链接: https://arxiv.org/abs/2604.01142
作者: Shaifalee Saxena,Rafael Fierro,Alexander Scheinker
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.

[LG-6] Reconsidering Dependency Networks from an Information Geometry Perspective

链接: https://arxiv.org/abs/2604.01117
作者: Kazuya Takabatake,Shotaro Akaho
类目: Machine Learning (cs.LG)
*备注: 25 papers, 7 figures

点击查看摘要

Abstract:Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational advantages over Bayesian and Markov networks, the theoretical foundations of dependency networks remain incomplete, primarily because their model distributions – defined as stationary distributions of pseudo-Gibbs sampling – lack closed-form expressions. This paper develops an information-geometric analysis of pseudo-Gibbs sampling, interpreting each sampling step as an m-projection onto a full conditional manifold. Building on this interpretation, we introduce the full conditional divergence and derive an upper bound that characterizes the location of the stationary distribution in the space of probability distributions. We then reformulate both structure and parameter learning as optimization problems that decompose into independent subproblems for each node, and prove that the learned model distribution converges to the true underlying distribution as the number of training samples grows to infinity. Experiments confirm that the proposed upper bound is tight in practice.

[LG-7] Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPs

链接: https://arxiv.org/abs/2604.01024
作者: Philip Jordan,Maryam Kamgarpour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.

[LG-8] EmbedPart: Embedding-Driven Graph Partitioning for Scalable Graph Neural Network Training

链接: https://arxiv.org/abs/2604.01000
作者: Nikolai Merkel,Ruben Mayer,Volker Markl,Hans-Arno Jacobsen
类目: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used for learning on graph-structured data, but scaling GNN training to massive graphs remains challenging. To enable scalable distributed training, graphs are divided into smaller partitions that are distributed across multiple machines such that inter-machine communication is minimized and computational load is balanced. In practice, existing partitioning approaches face a fundamental trade-off between partitioning overhead and partitioning quality. We propose EmbedPart, an embedding-driven partitioning approach that achieves both speed and quality. Instead of operating directly on irregular graph structures, EmbedPart leverages node embeddings produced during the actual GNN training workload and clusters these dense embeddings to derive a partitioning. EmbedPart achieves more than 100x speedup over Metis while maintaining competitive partitioning quality and accelerating distributed GNN training. Moreover, EmbedPart naturally supports graph updates and fast repartitioning, and can be applied to graph reordering to improve data locality and accelerate single-machine GNN training. By shifting partitioning from irregular graph structures to dense embeddings, EmbedPart enables scalable and high-quality graph data optimization.

[LG-9] Rapid mixing in positively weighted restricted Boltzmann machines

链接: https://arxiv.org/abs/2604.00963
作者: Weiming Feng,Heng Guo,Minji Yang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We show polylogarithmic mixing time bounds for the alternating-scan sampler for positively weighted restricted Boltzmann machines. This is done via analysing the same chain and the Glauber dynamics for ferromagnetic two-spin systems, where we obtain new mixing time bounds up to the critical thresholds.

[LG-10] Differentially Private Manifold Denoising

链接: https://arxiv.org/abs/2604.00942
作者: Jiaqi Wu,Yiqing Sun,Zhigang Yao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST)
*备注: 59 pages

点击查看摘要

Abstract:We introduce a differentially private manifold denoising framework that allows users to exploit sensitive reference datasets to correct noisy, non-private query points without compromising privacy. The method follows an iterative procedure that (i) privately estimates local means and tangent geometry using the reference data under calibrated sensitivity, (ii) projects query points along the privately estimated subspace toward the local mean via corrective steps at each iteration, and (iii) performs rigorous privacy accounting across iterations and queries using (\varepsilon,\delta) -differential privacy (DP). Conceptually, this framework brings differential privacy to manifold methods, retaining sufficient geometric signal for downstream tasks such as embedding, clustering, and visualization, while providing formal DP guarantees for the reference data. Practically, the procedure is modular and scalable, separating DP-protected local geometry (means and tangents) from budgeted query-point updates, with a simple scheduler allocating privacy budget across iterations and queries. Under standard assumptions on manifold regularity, sampling density, and measurement noise, we establish high-probability utility guarantees showing that corrected queries converge toward the manifold at a non-asymptotic rate governed by sample size, noise level, bandwidth, and the privacy budget. Simulations and case studies demonstrate accurate signal recovery under moderate privacy budgets, illustrating clear utility-privacy trade-offs and providing a deployable DP component for manifold-based workflows in regulated environments without reengineering privacy systems.

[LG-11] Generalization Bounds for Spectral GNNs via Fourier Domain Analysis AISTATS2026

链接: https://arxiv.org/abs/2604.00918
作者: Vahan A. Martirosyan,Daniele Malitesta,Hugues Talbot,Jhony H. Giraldo,Fragkiskos D. Malliaros
类目: Machine Learning (cs.LG)
*备注: Accepted to AISTATS 2026

点击查看摘要

Abstract:Spectral graph neural networks learn graph filters, but their behavior with increasing depth and polynomial order is not well understood. We analyze these models in the graph Fourier domain, where each layer becomes an element-wise frequency update, separating the fixed spectrum from trainable parameters and making depth and order explicit. In this setting, we show that Gaussian complexity is invariant under the Graph Fourier Transform, which allows us to derive data-dependent, depth, and order-aware generalization bounds together with stability estimates. In the linear case, our bounds are tighter, and on real graphs, the data-dependent term correlates with the generalization gap across polynomial bases, highlighting practical choices that avoid frequency amplification across layers.

[LG-12] Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects

链接: https://arxiv.org/abs/2604.00915
作者: Haorui Ma,Dennis Frauen,Valentyn Melnychuk,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimation of heterogeneous long-term treatment effects (HLTEs) is widely used for personalized decision-making in marketing, economics, and medicine, where short-term randomized experiments are often combined with long-term observational data. However, HLTE estimation is challenging due to limited overlap in treatment or in observing long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large finite-sample variance. To address this challenge, we introduce the LT-O-learners (Long-Term Orthogonal Learners), a set of novel orthogonal learners for HLTE estimation. The learners are designed for the canonical HLTE setting that combines a short-term randomized dataset \mathcalD_1 with a long-term historical dataset \mathcalD_2 . The key idea of our LT-O-Learners is to retarget the learning objective by introducing custom overlap weights that downweight samples with low overlap in treatment or in long-term observation. We show that the retargeted loss is equivalent to the weighted oracle loss and satisfies Neyman-orthogonality, which means our learners are robust to errors in the nuisance estimation. We further provide a general error bound for the LT-O-Learners and give the conditions under which quasi-oracle rate can be achieved. Finally, our LT-O-learners are model-agnostic and can thus be instantiated with arbitrary machine learning models. We conduct empirical evaluations on synthetic and semi-synthetic benchmarks to confirm the theoretical properties of our LT-O-Learners, especially the robustness in low-overlap settings. To the best of our knowledge, ours are the first orthogonal learners for HLTE estimation that are robust to low overlap that is common in long-term outcomes.

[LG-13] Event Embedding of Protein Networks : Compositional Learning of Biological Function ICLR2026

链接: https://arxiv.org/abs/2604.00911
作者: Antonin Sulc
类目: Machine Learning (cs.LG)
*备注: Machine Learning for Genomics Explorations (MLGenX) ICLR 2026 Workshop

点击查看摘要

Abstract:In this work, we study whether enforcing strict compositional structure in sequence embeddings yields meaningful geometric organization when applied to protein-protein interaction networks. Using Event2Vec, an additive sequence embedding model, we train 64-dimensional representations on random walks from the human STRING interactome, and compare against a DeepWalk baseline based on Word2Vec, trained on the same walks. We find that compositional structure substantially improves pathway coherence (30.2 \times vs 2.9 \times above random), functional analogy accuracy (mean similarity 0.966 vs 0.650), and hierarchical pathway organization, while geometric properties such as norm–degree anticorrelation are shared with or exceeded by the non-compositional baseline. These results indicate that enforced compositionality specifically benefits relational and compositional reasoning tasks in biological networks.

[LG-14] Fatigue-Aware Learning to Defer via Constrained Optimisation

链接: https://arxiv.org/abs/2604.00904
作者: Zheng Zhang,Cuong C. Nguyen,David Rosewarne,Kevin Wells,Gustavo Carneiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.

[LG-15] Accurate and Scalable Matrix Mechanisms via Divide and Conquer

链接: https://arxiv.org/abs/2604.00868
作者: Guanlin He,Yingtai Xiao,Jiamu Bai,Xin Gu,Zeyu Ding,Wenpeng Yin,Daniel Kifer
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher. Comments: 17 pages Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2604.00868 [cs.DB] (or arXiv:2604.00868v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2604.00868 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Policy Improvement Reinforcement Learning

链接: https://arxiv.org/abs/2604.00860
作者: Huaiyang Wang,Xiaojie Li,Deqing Wang,Haoyi Zhou,Zixuan Huang,Yaodong Yang,Jianxin Li,Yikun Ban
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design – updating in isolation at each step, guided only by within-group (batch) reward signals – means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones – transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

[LG-17] Optimal Brain Decomposition for Accurate LLM Low-Rank Approximation

链接: https://arxiv.org/abs/2604.00821
作者: Yuhang Li,Donghyun Lee,Ruokai Yin,Priyadarshini Panda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Decomposition LLM (OBD-LLM), which studies the decomposition problem in the model space by utilizing second-order Hessian information. Through a rigorous Kronecker-factorization of the Hessian, we show that the decomposition needs to consider both input and output information of the layer, and achieves much better decomposition results compared to input only method. Our loss-aware decomposition method involves a bi-directional whitening on the weight matrix. As a result, OBD-LLM is a closed-form solution for the optimal decomposition of weights in the language model. Remarkably, we achieve ~20-40% better results than previous state-of-the-art decomposition methods, the SVD-LLM.

[LG-18] Cost-Penalized Fitness in FMA-Orchestrated Mixture of Experts: Experimental Evidence for Molecular Memory in Domain Adaptation

链接: https://arxiv.org/abs/2604.00812
作者: Martin Jaraiz
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, draft

点击查看摘要

Abstract:We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address a fundamental question for advanced LLM development: how should an MoE system manage its expert pool when operating at full capacity under changing data distributions? We demonstrate that cost-penalized fitness metrics, combined with a linear grace period for newborn experts, produce a system that accumulates domain expertise through diversification rather than replacement. The central result is a round-trip domain shift experiment showing 9-11x faster recovery when returning to a previously learned domain, with zero expert births or replacements required. This “molecular memory” effect – where dormant experts survive and reactivate when their domain returns – has no analogue in current MoE management approaches. A preliminary cost analysis estimates annual savings of 39.1M and 27.1 GWh energy reduction for an OpenAI-scale provider under a moderate scenario.

[LG-19] MIRANDA: MId-feature RANk-adversarial Domain Adaptation toward climate change-robust ecological forecasting with deep learning CVPR

链接: https://arxiv.org/abs/2604.00800
作者: Yuchang Jiang,Jan Dirk Wegner,Vivien Sainte Fare Garnot
类目: Machine Learning (cs.LG)
*备注: EarthVision CVPRW 2026

点击查看摘要

Abstract:Plant phenology modelling aims to predict the timing of seasonal phases, such as leaf-out or flowering, from meteorological time series. Reliable predictions are crucial for anticipating ecosystem responses to climate change. While phenology modelling has traditionally relied on mechanistic approaches, deep learning methods have recently been proposed as flexible, data-driven alternatives with often superior performance. However, mechanistic models tend to outperform deep networks when data distribution shifts are induced by climate change. Domain Adaptation (DA) techniques could help address this limitation. Yet, unlike standard DA settings, climate change induces a temporal continuum of domains and involves both a covariate and label shift, with warmer records and earlier start of spring. To tackle this challenge, we introduce Mid-feature Rank-adversarial Domain Adaptation (MIRANDA). Whereas conventional adversarial methods enforce domain invariance on final latent representations, an approach that does not explicitly address label shift, we apply adversarial regularization to intermediate features. Moreover, instead of a binary domain-classification objective, we employ a rank-based objective that enforces year-invariance in the learned meteorological representations. On a country-scale dataset spanning 70 years and comprising 67,800 phenological observations of 5 tree species, we demonstrate that, unlike conventional DA approaches, MIRANDA improves robustness to climatic distribution shifts and narrows the performance gap with mechanistic models.

[LG-20] ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding

链接: https://arxiv.org/abs/2604.00767
作者: Lala Shakti Swarup Ray,Mengxi Liu,Alcina Pinto,Deepika Gurung,Daniel Geissler,Paul Lukowoicz,Bo Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.

[LG-21] Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

链接: https://arxiv.org/abs/2604.00726
作者: Anton Altenbernd,Philipp Wiesner,Odej Kao
类目: Machine Learning (cs.LG)
*备注: 10 Pages, 4 Figures, CCGrid 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events. Comments: 10 Pages, 4 Figures, CCGrid 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.00726 [cs.LG] (or arXiv:2604.00726v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00726 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Performance of Neural and Polynomial Operator Surrogates

链接: https://arxiv.org/abs/2604.00689
作者: Josephine Westermann,Benno Huber,Thomas O’Leary-Roseberry,Jakob Zech
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 44 pages, 21 figures

点击查看摘要

Abstract:We consider the problem of constructing surrogate operators for parameter-to-solution maps arising from parametric partial differential equations, where repeated forward model evaluations are computationally expensive. We present a systematic empirical comparison of neural operator surrogates, including a reduced-basis neural operator trained with L^2_\mu and H^1_\mu objectives and the Fourier neural operator, against polynomial surrogate methods, specifically a reduced-basis sparse-grid surrogate and a reduced-basis tensor-train surrogate. All methods are evaluated on a linear parametric diffusion problem and a nonlinear parametric hyperelasticity problem, using input fields with algebraically decaying spectral coefficients at varying rates of decay s . To enable fair comparisons, we analyze ensembles of surrogate models generated by varying hyperparameters and compare the resulting Pareto frontiers of cost versus approximation accuracy, decomposing cost into contributions from data generation, setup, and evaluation. Our results show that no single method is universally superior. Polynomial surrogates achieve substantially better data efficiency for smooth input fields ( s \geq 2 ), with convergence rates for the sparse-grid surrogate in agreement with theoretical predictions. For rough inputs ( s \leq 1 ), the Fourier neural operator displays the fastest convergence rates. Derivative-informed training consistently improves data efficiency over standard L^2_\mu training, providing a competitive alternative for rough inputs in the low-data regime when Jacobian information is available at reasonable cost. These findings highlight the importance of matching the surrogate methodology to the regularity of the problem as well as accuracy demands and computational constraints of the application.

[LG-23] Full-Gradient Successor Feature Representations

链接: https://arxiv.org/abs/2604.00686
作者: Ritish Shrirao,Aditya Priyadarshi,Raghuram Bharadwaj Diddigi
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE CDC 2026

点击查看摘要

Abstract:Successor Features (SF) combined with Generalized Policy Improvement (GPI) provide a robust framework for transfer learning in Reinforcement Learning (RL) by decoupling environment dynamics from reward functions. However, standard SF learning methods typically rely on semi-gradient Temporal Difference (TD) updates. When combined with non-linear function approximation, semi-gradient methods lack robust convergence guarantees and can lead to instability, particularly in the multi-task setting where accurate feature estimation is critical for effective GPI. Inspired by Full Gradient DQN, we propose Full-Gradient Successor Feature Representations Q-Learning (FG-SFRQL), an algorithm that optimizes the successor features by minimizing the full Mean Squared Bellman Error. Unlike standard approaches, our method computes gradients with respect to parameters in both the online and target networks. We provide a theoretical proof of almost-sure convergence for FG-SFRQL and demonstrate empirically that minimizing the full residual leads to superior sample efficiency and transfer performance compared to semi-gradient baselines in both discrete and continuous domains.

[LG-24] Embedded Variational Neural Stochastic Differential Equations for Learning Heterogeneous Dynamics

链接: https://arxiv.org/abs/2604.00669
作者: Sandeep Kumar Samota,Reema Gupta,Snehashish Chakraverty
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:This study examines the challenges of modeling complex and noisy data related to socioeconomic factors over time, with a focus on data from various districts in Odisha, India. Traditional time-series models struggle to capture both trends and variations together in this type of data. To tackle this, a Variational Neural Stochastic Differential Equation (V-NSDE) model is designed that combines the expressive dynamics of Neural SDEs with the generative capabilities of Variational Autoencoders (VAEs). This model uses an encoder and a decoder. The encoder takes the initial observations and district embeddings and translates them into a Gaussian distribution, which determines the mean and log-variance of the first latent state. Then the obtained latent state initiates the Neural SDE, which utilize neural networks to determine the drift and diffusion functions that govern continuous-time latent dynamics. These governing functions depend on the time index, latent state, and district embedding, which help the model learn the unique characteristics specific to each district. After that, using a probabilistic decoder, the observations are reconstructed from the latent trajectory. The decoder outputs a mean and log-variance for each time step, which follows the Gaussian likelihood. The Evidence Lower Bound (ELBO) training loss improves by adding a KL-divergence regularization term to the negative log-likelihood (nll). The obtained results demonstrate the effective learning of V-NSDE in recognizing complex patterns over time, yielding realistic outcomes that include clear trends and random fluctuations across different areas.

[LG-25] Chameleons do not Forget: Prompt-Based Online Continual Learning for Next Activity Prediction

链接: https://arxiv.org/abs/2604.00653
作者: Marwan Hassani,Tamara Verbeek,Sjoerd van Straten
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for publication in the International Journal of Cooperative Information Systems

点击查看摘要

Abstract:Predictive process monitoring (PPM) focuses on predicting future process trajectories, including next activity predictions. This is crucial in dynamic environments where processes change or face uncertainty. However, current frameworks often assume a static environment, overlooking dynamic characteristics and concept drifts. This results in catastrophic forgetting, where training while focusing merely on new data distribution negatively impacts the performance on previously learned data distributions. Continual learning addresses, among others, the challenges related to mitigating catastrophic forgetting. This paper proposes a novel approach called Continual Next Activity Prediction with Prompts (CNAPwP), which adapts the DualPrompt algorithm for next activity prediction to improve accuracy and adaptability while mitigating catastrophic forgetting. We introduce new datasets with recurring concept drifts, alongside a task-specific forgetting metric that measures the prediction accuracy gap between initial occurrence and subsequent task occurrences. Extensive testing on three synthetic and two real-world datasets representing several setups of recurrent drifts shows that CNAPwP achieves SOTA or competitive results compared to five baselines, demonstrating its potential applicability in real-world scenarios. An open-source implementation of our method, together with the datasets and results, is available at: this https URL.

[LG-26] On rankings in multiplayer games with an application to the game of Whist

链接: https://arxiv.org/abs/2604.00641
作者: Alexis Coyette,Charles Modera,Candy Sonveaux,Judicaël Mohet,Francçois-Grégoire Bierwart,Sylverio Pool Marquez,Jarod Ketcha Kouakep,Cédric Simal,Komlan Fiagbe,Violaine Piengeon,Martin Moriamé,Justine Bodart,Marie Dorchain,Maxime Lucas,Rommel Tchinda Djeudjo,Gianluca Peri,Eve Tilman
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Author order determined by the proposed ranking method

点击查看摘要

Abstract:We propose a novel extension of the Bradley-Terry model to multiplayer games and adapt a recent algorithm by Newman [1] to our model. We demonstrate the use of our proposed method on synthetic datasets and on a real dataset of games of cards.

[LG-27] Predicting Dynamics of Ultra-Large Complex Systems by Inferring Governing Equations

链接: https://arxiv.org/abs/2604.00599
作者: Qi Shao,Duxin Chen,Jiawen Chen,Yujie Zeng,Athen Ma,Wenwu Yu,Vito Latora,Wei Lin
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, under review

点击查看摘要

Abstract:Predicting the behavior of ultra-large complex systems, from climate to biological and technological networks, is a central unsolved challenge. Existing approaches face a fundamental trade-off: equation discovery methods provide interpretability but fail to scale, while neural networks scale but operate as black boxes and often lose reliability over long times. Here, we introduce the Sparse Identification Graph Neural Network, a framework that overcome this divide by allowing to infer the governing equations of large networked systems from data. By defining symbolic discovery as edge-level information, SIGN decouples the scalability of sparse identification from network size, enabling efficient equation discovery even in large systems. SIGN allows to study networks with over 100,000 nodes while remaining robust to noise, sparse sampling, and missing data. Across diverse benchmark systems, including coupled chaotic oscillators, neural dynamics, and epidemic spreading, it recovers governing equations with high precision and sustains accurate long-term predictions. Applied to a data set of time series of temperature measurements in 71,987 sea surface positions, SIGN identifies a compact predictive network model and captures large-scale sea surface temperature conditions up to two years in advance. By enabling equation discovery at previously inaccessible scales, SIGN opens a path toward interpretable and reliable prediction of real-world complex systems.

[LG-28] Representation choice shapes the interpretation of protein conformational dynamics

链接: https://arxiv.org/abs/2604.00580
作者: Axel Giottonini,Thomas Lemmin
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Molecular dynamics simulations provide detailed trajectories at the atomic level, but extracting interpretable and robust insights from these high-dimensional data remains challenging. In practice, analyses typically rely on a single representation. Here, we show that representation choice is not neutral: it fundamentally shapes the conformational organization, similarity relationships, and apparent transitions inferred from identical simulation data. To complement existing representations, we introduce Orientation features, a geometrically grounded, rotation-aware encoding of protein backbone. We compare it against common descriptions across three dynamical regimes: fast-folding proteins, large-scale domain motions, and protein-protein association. Across these systems, we find that different representations emphasize complementary aspects of conformational space, and that no single representation provides a complete picture of the underlying dynamics. To facilitate systematic comparison, we developed ManiProt, a library for efficient computation and analysis of multiple protein representations. Our results motivate a comparative, representation-aware framework for the interpretation of molecular dynamics simulations. Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM) Cite as: arXiv:2604.00580 [cs.LG] (or arXiv:2604.00580v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] Learning from Many and Adapting to the Unknown in Open-set Test Streams

链接: https://arxiv.org/abs/2604.00533
作者: Xiao Zhang,Juntao Lyu,Tianyu Hu,Qianchuan Zhao,Huimin Ma
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) generalize across tasks via reusable representations and flexible reasoning, yet remain brittle in real deployment under evolving tasks and continual distribution shift. A common approach is Test-Time Adaptation (TTA), existing ones of which updates models with hand-designed unsupervised objectives over the full parameter space and mostly overlook preserving shared source knowledge and the reliability of adaptation signals. Drawing on molecular signaling cascades of memory updating in Drosophila, we propose Synapse Consolidation (SyCo), a parameter-efficient LLM adaptation method that updates low-rank adapters through Rac1 and MAPK pathways under the guidance of a structured TTA objective driven by problem understanding, process understanding, and source-domain guardrail. Rac1 confines plasticity to a tail-gradient subspace that is less critical for source knowledge, enabling rapid specialization while preserving source representations. MAPK uses a tiered controller to suppress noisy updates and consolidate useful adaptations under non-stationary streams. To model real deployments with multiple sources and continually emerging tasks, we introduce Multi-source Open-set Adaptation (MOA) setting, where a model is trained on multiple labeled source tasks and then adapts on open, non-stationary unlabeled test streams that mix seen and unseen tasks with partial overlap in label and intent space. Across 18 NLP datasets and the MOA setting, SyCo consistently outperforms strong baselines, achieving 78.31% on unseen-task adaptation and 85.37% on unseen-data shifts.

[LG-30] Learning Shared Representations for Multi-Task Linear Bandits

链接: https://arxiv.org/abs/2604.00531
作者: Jiabin Lin,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task representation learning is an approach that learns shared latent representations across related tasks, facilitating knowledge transfer and improving sample efficiency. This paper introduces a novel approach to multi-task representation learning in linear bandits. We consider a setting with T concurrent linear bandit tasks, each with feature dimension d, that share a common latent representation of dimension r \ll mind,T , capturing their underlying relatedness. We propose a new Optimism in the Face of Uncertainty Linear (OFUL) algorithm that leverages shared low-rank representations to enhance decision-making in a sample-efficient manner. Our algorithm first collects data through an exploration phase, estimates the shared model via spectral initialization, and then conducts OFUL based learning over a newly constructed confidence set. We provide theoretical guarantees for the confidence set and prove that the unknown reward vectors lie within the confidence set with high probability. We derive cumulative regret bounds and show that the proposed approach achieves \tildeO(\sqrtdrNT), a significant improvement over solving the T tasks independently, resulting in a regret of \tildeO(dT\sqrtN). We performed numerical simulations to validate the performance of our algorithm for different problem sizes.

[LG-31] A Decoupled Basis-Vector-Driven Generative Framework for Dynamic Multi-Objective Optimization

链接: https://arxiv.org/abs/2604.00508
作者: Yaoming Yang,Shuai Wang,Bingdong Li,Peng Yang,Ke Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic multi-objective optimization requires continuous tracking of moving Pareto fronts. Existing methods struggle with irregular mutations and data sparsity, primarily facing three challenges: the non-linear coupling of dynamic modes, negative transfer from outdated historical data, and the cold-start problem during environmental switches. To address these issues, this paper proposes a decoupled basis-vector-driven generative framework (DB-GEN). First, to resolve non-linear coupling, the framework employs the discrete wavelet transform to separate evolutionary trajectories into low-frequency trends and high-frequency details. Second, to mitigate negative transfer, it learns transferable basis vectors via sparse dictionary learning rather than directly memorizing historical instances. Recomposing these bases under a topology-aware contrastive constraint constructs a structured latent manifold. Finally, to overcome the cold-start problem, a surrogate-assisted search paradigm samples initial populations from this manifold. Pre-trained on 120 million solutions, DB-GEN performs direct online inference without retraining or fine-tuning. This zero-shot generation process executes in milliseconds, requiring approximately 0.2 seconds per environmental change. Experimental results demonstrate that DB-GEN improves tracking accuracy across various dynamic benchmarks compared to existing algorithms.

[LG-32] Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

链接: https://arxiv.org/abs/2604.00499
作者: Haoyu Zheng,Yongqiang Zhang,Fangcheng Fu,Xiaokai Zhou,Hao Luo,Hongchao Zhu,Yuanyuan Zhu,Hao Wang,Xiao Yan,Jiawei Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To schedule LLM inference, the \textitshortest job first (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textitpoint estimate does not match the \textitstochastic decoding process of LLM inference, where output length is \textituncertain by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by 2.31\times for online inference and improves throughput by 1.42\times for offline data generation.

[LG-33] he Rashomon Effect for Visualizing High-Dimensional Data AISTATS2026

链接: https://arxiv.org/abs/2604.00485
作者: Yiyang Sun,Haiyang Huang,Gaurav Rajesh Parikh,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注: The paper is accepted in AISTATS 2026

点击查看摘要

Abstract:Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR – the collection of `good’ embedding – and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method to extract common knowledge across the Rashomon set by identifying trustworthy and persistent nearest-neighbor relationships, which we use to construct refined embeddings with improved local structure while preserving global relationships. By moving beyond a single embedding and leveraging the Rashomon set, we provide a flexible framework for building interpretable, robust, and goal-aligned visualizations.

[LG-34] Phase space integrity in neural network models of Hamiltonian dynamics: A Lagrangian descriptor approach

链接: https://arxiv.org/abs/2604.00473
作者: Abrari Noor Hasmi,Haralampos Hatzikirou,Hadi Susanto
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 40 pages, 22 figures

点击查看摘要

Abstract:We propose Lagrangian Descriptors (LDs) as a diagnostic framework for evaluating neural network models of Hamiltonian systems beyond conventional trajectory-based metrics. Standard error measures quantify short-term predictive accuracy but provide little insight into global geometric structures such as orbits and separatrices. Existing evaluation tools in dissipative systems are inadequate for Hamiltonian dynamics due to fundamental differences in the systems. By constructing probability density functions weighted by LD values, we embed geometric information into a statistical framework suitable for information-theoretic comparison. We benchmark physically constrained architectures (SympNet, HénonNet, Generalized Hamiltonian Neural Networks) against data-driven Reservoir Computing across two canonical systems. For the Duffing oscillator, all models recover the homoclinic orbit geometry with modest data requirements, though their accuracy near critical structures varies. For the three-mode nonlinear Schrödinger equation, however, clear differences emerge: symplectic architectures preserve energy but distort phase-space topology, while Reservoir Computing, despite lacking explicit physical constraints, reproduces the homoclinic structure with high fidelity. These results demonstrate the value of LD-based diagnostics for assessing not only predictive performance but also the global dynamical integrity of learned Hamiltonian models.

[LG-35] Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

链接: https://arxiv.org/abs/2604.00422
作者: Xinyu Sun,Wanwei Liu,Haoang Chi,Tingyu Chen,Xiaoguang Mao,Shangwen Wang,Lei Bu,Jingyi Wang,Yang Tan,Zhenyi Qi
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DNNs are susceptible to defects like backdoors, adversarial attacks, and unfairness, undermining their reliability. Existing approaches mainly involve retraining, optimization, constraint-solving, or search algorithms. However, most methods rely on gradient calculations, restricting applicability to specific activation functions (e.g., ReLU), or use search algorithms with uninterpretable localization and repair. Furthermore, they often lack generalizability across multiple properties. We propose SHARPEN, integrating interpretable fault localization with a derivative-free optimization strategy. First, SHARPEN introduces a Deep SHAP-based localization strategy quantifying each layer’s and neuron’s marginal contribution to erroneous outputs. Specifically, a hierarchical coarse-to-fine approach reranks layers by aggregated impact, then locates faulty neurons/filters by analyzing activation divergences between property-violating and benign states. Subsequently, SHARPEN incorporates CMA-ES to repair identified neurons. CMA-ES leverages a covariance matrix to capture variable dependencies, enabling gradient-free search and coordinated adjustments across coupled neurons. By combining interpretable localization with evolutionary optimization, SHARPEN enables derivative-free repair across architectures, being less sensitive to gradient anomalies and hyperparameters. We demonstrate SHARPEN’s effectiveness on three repair tasks. Balancing property repair and accuracy preservation, it outperforms baselines in backdoor removal (+10.56%), adversarial mitigation (+5.78%), and unfairness repair (+11.82%). Notably, SHARPEN handles diverse tasks, and its modular design is plug-and-play with different derivative-free optimizers, highlighting its flexibility.

[LG-36] A Cross-graph Tuning-free GNN Prompting Framework

链接: https://arxiv.org/abs/2604.00399
作者: Yaqi Chen,Shixun Huang,Ryan Twemlow,Lei Wang,John Le,Sheng Wang,Willy Susilo,Jun Yan,Jun Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GNN prompting aims to adapt models across tasks and graphs without requiring extensive retraining. However, most existing graph prompt methods still require task-specific parameter updates and face the issue of generalizing across graphs, limiting their performance and undermining the core promise of prompting. In this work, we introduce a Cross-graph Tuning-free Prompting Framework (CTP), which supports both homogeneous and heterogeneous graphs, can be directly deployed to unseen graphs without further parameter tuning, and thus enables a plug-and-play GNN inference engine. Extensive experiments on few-shot prediction tasks show that, compared to SOTAs, CTP achieves an average accuracy gain of 30.8% and a maximum gain of 54%, confirming its effectiveness and offering a new perspective on graph prompt learning.

[LG-37] Gradient-Based Data Valuation Improves Curriculum Learning for Game-Theoretic Motion Planning

链接: https://arxiv.org/abs/2604.00388
作者: Shihao Li,Jiachen Li,Dongmei Chen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We demonstrate that gradient-based data valuation produces curriculum orderings that significantly outperform metadata-based heuristics for training game-theoretic motion planners. Specifically, we apply TracIn gradient-similarity scoring to GameFormer on the nuPlan benchmark and construct a curriculum that weights training scenarios by their estimated contribution to validation loss reduction. Across three random seeds, the TracIn-weighted curriculum achieves a mean planning ADE of 1.704\pm0.029 ,m, significantly outperforming the metadata-based interaction-difficulty curriculum ( 1.822\pm0.014 ,m; paired t -test p=0.021 , Cohen’s d_z=3.88 ) while exhibiting lower variance than the uniform baseline ( 1.772\pm0.134 ,m). Our analysis reveals that TracIn scores and scenario metadata are nearly orthogonal (Spearman \rho=-0.014 ), indicating that gradient-based valuation captures training dynamics invisible to hand-crafted features. We further show that gradient-based curriculum weighting succeeds where hard data selection fails: TracIn-curated 20% subsets degrade performance by 2\times , whereas full-data curriculum weighting with the same scores yields the best results. These findings establish gradient-based data valuation as a practical tool for improving sample efficiency in game-theoretic planning.

[LG-38] GUIDE: Reinforcement Learning for Behavioral Action Support in Type 1 Diabetes

链接: https://arxiv.org/abs/2604.00385
作者: Saman Khamesian,Sri Harini Balaji,Di Yang Shi,Stephanie M. Carpenter,Daniel E. Rivera,W. Bradley Knox,Peter Stone,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to prevent abnormal glucose events. GUIDE generates structured actions defined by intervention type, magnitude, and timing, including bolus insulin administration and carbohydrate intake events. GUIDE integrates a patient-specific glucose level predictor trained on real-world continuous glucose monitoring data and supports both offline and online RL algorithms within a unified environment. We evaluate both off-policy and on-policy methods across 25 individuals with T1D using standardized glycemic metrics. Among the evaluated approaches, the CQL-BC algorithm demonstrates the highest average time-in-range, reaching 85.49% while maintaining low hypoglycemia exposures. Behavioral similarity analysis further indicates that the learned CQL-BC policy preserves key structural characteristics of patient action patterns, achieving a mean cosine similarity of 0.87 \pm 0.09 across subjects. These findings suggest that conservative offline RL with a structured behavioral action space can provide clinically meaningful and behaviorally plausible decision support for personalized diabetes management.

[LG-39] Deep Learning-Accelerated Surrogate Optimization for High-Dimensional Well Control in Stress-Sensitive Reservoirs

链接: https://arxiv.org/abs/2604.00352
作者: Mahammad Valiyev,Jodel Cornelio,Behnam Jafarpour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Production optimization in stress-sensitive unconventional reservoirs is governed by a nonlinear trade-off between pressure-driven flow and stress-induced degradation of fracture conductivity and matrix permeability. While higher drawdown improves short-term production, it accelerates permeability loss and reduces long-term recovery. Identifying optimal, time-varying control strategies requires repeated evaluations of fully coupled flow-geomechanics simulators, making conventional optimization computationally expensive. We propose a deep learning-based surrogate optimization framework for high-dimensional well control. Unlike prior approaches that rely on predefined control parameterizations or generic sampling, our method treats well control as a continuous, high-dimensional problem and introduces a problem-informed sampling strategy that aligns training data with trajectories encountered during optimization. A neural network proxy is trained to approximate the mapping between bottomhole pressure trajectories and cumulative production using data from a coupled flow-geomechanics model. The proxy is embedded within a constrained optimization workflow, enabling rapid evaluation of control strategies. Across multiple initializations, the surrogate achieves agreement with full-physics solutions within 2-5 percent, while reducing computational cost by up to three orders of magnitude. Discrepancies are mainly associated with trajectories near the boundary of the training distribution and local optimization effects. This framework shows that combining surrogate modeling with problem-informed sampling enables scalable and reliable optimization for high-dimensional, simulator-based problems, with broader applicability to PDE-constrained systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.00352 [cs.LG] (or arXiv:2604.00352v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.00352 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mahammad Valiyev [view email] [v1] Wed, 1 Apr 2026 01:04:08 UTC (1,023 KB)

[LG-40] Is One Token All It Takes? Graph Pooling Tokens for LLM -based GraphQA LREC

链接: https://arxiv.org/abs/2604.00342
作者: Ankit Grover,Lodovico Giaretta,Rémi Bourgerie,Sarunas Girdzijauskas
类目: Machine Learning (cs.LG)
*备注: Accepted at LREC, KG-LLM Workshop 2026

点击查看摘要

Abstract:The integration of Graph Neural Networks (GNNs) with Large Language Models (LLMs) has emerged as a promising paradigm for Graph Question Answering (GraphQA). However, effective methods for encoding complex structural information into the LLM’s latent space remain an open challenge. Current state-of-the-art architectures, such as G-Retriever, typically rely on standard GNNs and aggressive mean pooling to compress entire graph substructures into a single token, creating a severe information bottleneck. This work mitigates this bottleneck by investigating two orthogonal strategies: (1) increasing the bandwidth of the graph-to-LLM interface via multi-token pooling, and (2) enhancing the semantic quality of the graph encoder via global attention mechanisms. We evaluate a suite of hierarchical pruning and clustering-based pooling operators including Top-k, SAGPool, DiffPool, MinCutPool, and Virtual Node Pooling (VNPool) to project graph data into multiple learnable tokens. Empirically, we demonstrate that while pooling introduces significant instability during soft prompt tuning, the application of Low-Rank Adaptation (LoRA) effectively stabilizes specific hierarchical projections (notably VNPool and pruning methods), though dense clustering operators remain challenging. This stabilization allows compressed representations to rival full-graph baselines (achieving ~73% Hit@1 on WebQSP). Conceptually, we demonstrate that a Graph Transformer with VNPool implementation functions structurally as a single-layer Perceiver IO encoder. Finally, we adapt the FandE (Features and Edges) Score to the generative GraphQA domain. Our analysis reveals that the GraphQA benchmark suffers from representational saturation, where target answers are often highly correlated with isolated node features. The implementation is available at this https URL

[LG-41] When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction

链接: https://arxiv.org/abs/2604.00339
作者: Yagiz Ihlamur
类目: Machine Learning (cs.LG)
*备注: 4 pages, 4 tables. Accepted at SecureFinAI Contest @ IEEE IDS 2026. Code: this https URL

点击查看摘要

Abstract:Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields – jobs, education, exits – and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 – a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly – it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic – one that points directly to what a richer dataset would need to include.

[LG-42] MVNN: A Measure-Valued Neural Network for Learning McKean-Vlasov Dynamics from Particle Data

链接: https://arxiv.org/abs/2604.00333
作者: Liyao Lyu,Xinyue Yu,Hayden Schaeffer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Collective behaviors that emerge from interactions are fundamental to numerous biological systems. To learn such interacting forces from observations, we introduce a measure-valued neural network that infers measure-dependent interaction (drift) terms directly from particle-trajectory observations. The proposed architecture generalizes standard neural networks to operate on probability measures by learning cylindrical features, using an embedding network that produces scalable distribution-to-vector representations. On the theory side, we establish well-posedness of the resulting dynamics and prove propagation-of-chaos for the associated interacting-particle system. We further show universal approximation and quantitative approximation rates under a low-dimensional measure-dependence assumption. Numerical experiments on first and second order systems, including deterministic and stochastic Motsch-Tadmor dynamics, two-dimensional attraction-repulsion aggregation, Cucker-Smale dynamics, and a hierarchical multi-group system, demonstrate accurate prediction and strong out-of-distribution generalization.

[LG-43] Vocal Prognostic Digital Biomarkers in Monitoring Chronic Heart Failure: A Longitudinal Observational Study

链接: https://arxiv.org/abs/2604.00308
作者: Fan Wu,Matthias P. Nägele,Daryush D. Mehta,Elgar Fleisch,Frank Ruschitzka,Andreas J. Flammer,Filipe Barata
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: This study aimed to evaluate which voice features can predict health deterioration in patients with chronic HF. Background: Heart failure (HF) is a chronic condition with progressive deterioration and acute decompensations, often requiring hospitalization and imposing substantial healthcare and economic burdens. Current standard-of-care (SoC) home monitoring, such as weight tracking, lacks predictive accuracy and requires high patient engagement. Voice is a promising non-invasive biomarker, though prior studies have mainly focused on acute HF stages. Methods: In a 2-month longitudinal study, 32 patients with HF collected daily voice recordings and SoC measures of weight and blood pressure at home, with biweekly questionnaires for health status. Acoustic analysis generated detailed vowel and speech features. Time-series features were extracted from aggregated lookback windows (e.g., 7 days) to predict next-day health status. Explainable machine learning with nested cross-validation identified top vocal biomarkers, and a case study illustrated model application. Results: A total of 21,863 recordings were analyzed. Acoustic vowel features showed strong correlations with health status. Time-series voice features within the lookback window outperformed corresponding standard care measures, achieving peak sensitivity and specificity of 0.826 and 0.782 versus 0.783 and 0.567 for SoC metrics. Key prognostic voice features identifying deterioration included delayed energy shift, low energy variability, and higher shimmer variability in vowels, along with reduced speaking and articulation rate, lower phonation ratio, decreased voice quality, and increased formant variability in speech. Conclusion: Voice-based monitoring offers a non-invasive approach to detect early health changes in chronic HF, supporting proactive and personalized care. Subjects: Sound (cs.SD); Machine Learning (cs.LG) Cite as: arXiv:2604.00308 [cs.SD] (or arXiv:2604.00308v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2604.00308 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fan Wu [view email] [v1] Tue, 31 Mar 2026 23:08:43 UTC (19,316 KB) Full-text links: Access Paper: View a PDF of the paper titled Vocal Prognostic Digital Biomarkers in Monitoring Chronic Heart Failure: A Longitudinal Observational Study, by Fan Wu and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SD prev | next new | recent | 2026-04 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-44] SAGE: Subsurface AI-driven Geostatistical Extraction with proxy posterior

链接: https://arxiv.org/abs/2604.00307
作者: Huseyin Tuna Erdinc,Ipsita Bhar,Rafael Orozco,Thales Souza,Felix J. Herrmann
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph); Machine Learning (stat.ML)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Recent advances in generative networks have enabled new approaches to subsurface velocity model synthesis, offering a compelling alternative to traditional methods such as Full Waveform Inversion. However, these approaches predominantly rely on the availability of large-scale datasets of high-quality, geologically realistic subsurface velocity models, which are often difficult to obtain in practice. We introduce SAGE, a novel framework for statistically consistent proxy velocity generation from incomplete observations, specifically sparse well logs and migrated seismic images. During training, SAGE learns a proxy posterior over velocity models conditioned on both modalities (wells and seismic); at inference, it produces full-resolution velocity fields conditioned solely on migrated images, with well information implicitly encoded in the learned distribution. This enables the generation of geologically plausible and statistically accurate velocity realizations. We validate SAGE on both synthetic and field datasets, demonstrating its ability to capture complex subsurface variability under limited observational constraints. Furthermore, samples drawn from the learned proxy distribution can be leveraged to train downstream networks, supporting inversion workflows. Overall, SAGE provides a scalable and data-efficient pathway toward learning geological proxy posterior for seismic imaging and inversion. Repo link: this https URL.

[LG-45] SYNTHONY: A Stress-Aware Intent-Conditioned Agent for Deep Tabular Generative Models Selection

链接: https://arxiv.org/abs/2604.00293
作者: Hochan Son,Xiaofeng Lin,Jason Ni,Guang Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep generative models for tabular data (GANs, diffusion models, and LLM-based generators) exhibit highly non-uniform behavior across datasets; the best-performing synthesizer family depends strongly on distributional stressors such as long-tailed marginals, high-cardinality categorical, Zipfian imbalance, and small-sample regimes. This brittleness makes practical deployment challenging, especially when users must balance competing objectives of fidelity, privacy, and utility. We study intent-conditioned tabular synthesis selection: given a dataset and a user intent expressed as a preference over evaluation metrics, the goal is to select a synthesizer that minimizes regret relative to an intent-specific oracle. We propose stress profiling, a synthesis-specific meta-feature representation that quantifies dataset difficulty along four interpretable stress dimensions, and integrate it into SYNTHONY, a selection framework that matches stress profiles against a calibrated capability registry of synthesizer families. Across a benchmark of 7 datasets, 10 synthesizers, and 3 intents, we demonstrate that stress-based meta-features are highly predictive of synthesizer performance: a k NN selector using these features achieves strong Top-1 selection accuracy, substantially outperforming zero-shot LLM selectors and random baselines. We analyze the gap between meta-feature-based and capability-based selection, identifying the hand-crafted capability registry as the primary bottleneck and motivating learned capability representations as a direction for future work.

[LG-46] MambaVoiceCloning: Efficient and Expressive Text-to-Speech via State-Space Modeling and Diffusion Control ICLR2026

链接: https://arxiv.org/abs/2604.00292
作者: Sahil Kumar,Namrataben Patel,Honggang Wang,Youshan Zhang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.

[LG-47] Data-Driven Reachability Analysis via Diffusion Models with PAC Guarantees

链接: https://arxiv.org/abs/2604.00283
作者: Yanliang Huang,Peng Xie,Wenyuan Wu,Zhuoqi Zeng,Amr Alanwar
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, submitted to the 65th IEEE Conference on Decision and Control (CDC 2026)

点击查看摘要

Abstract:We present a data-driven framework for reachability analysis of nonlinear dynamical systems that requires no explicit model. A denoising diffusion probabilistic model learns the time-evolving state distribution of a dynamical system from trajectory data alone. The predicted reachable set takes the form of a sublevel set of a nonconformity score derived from the reconstruction error, with the threshold calibrated via the Learn Then Test procedure so that the probability of excluding a reachable state is bounded with high probability. Experiments on three nonlinear systems, a forced Duffing oscillator, a planar quadrotor, and a high-dimensional reaction-diffusion system, confirm that the empirical miss rate remains below the Probably Approximately Correct (PAC) bound while scaling to state dimensions beyond the reach of classical grid-based and polynomial methods.

[LG-48] Autonomous Adaptive Solver Selection for Chemistry Integration via Reinforcement Learning

链接: https://arxiv.org/abs/2604.00264
作者: Eloghosa Ikponmwoba,Opeoluwa Owoyele
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The computational cost of stiff chemical kinetics remains a dominant bottleneck in reacting-flow simulation, yet hybrid integration strategies are typically driven by hand-tuned heuristics or supervised predictors that make myopic decisions from instantaneous local state. We introduce a constrained reinforcement learning (RL) framework that autonomously selects between an implicit BDF integrator (CVODE) and a quasi-steady-state (QSS) solver during chemistry integration. Solver selection is cast as a Markov decision process. The agent learns trajectory-aware policies that account for how present solver choices influence downstream error accumulation, while minimizing computational cost under a user-prescribed accuracy tolerance enforced through a Lagrangian reward with online multiplier adaptation. Across sampled 0D homogeneous reactor conditions, the RL-adaptive policy achieves a mean speedup of approximately 3\times , with speedups ranging from 1.11\times to 10.58\times , while maintaining accurate ignition delays and species profiles for a 106-species \textitn-dodecane mechanism and adding approximately 1% inference overhead. Without retraining, the 0D-trained policy transfers to 1D counterflow diffusion flames over strain rates 10 – 2000~\mathrms^-1 , delivering consistent \approx 2.2\times speedup relative to CVODE while preserving near-reference temperature accuracy and selecting CVODE at only 12 – 15% of space-time points. Overall, the results demonstrate the potential of the proposed reinforcement learning framework to learn problem-specific integration strategies while respecting accuracy constraints, thereby opening a pathway toward adaptive, self-optimizing workflows for multiphysics systems with spatially heterogeneous stiffness.

[LG-49] Learning to Shuffle: Block Reshuffling and Reversal Schemes for Stochastic Optimization

链接: https://arxiv.org/abs/2604.00260
作者: Lam M. Nguyen,Dzung T. Phan,Jayant Kalagnanam
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random reshuffling is known to improve optimization constants relative to cyclic and shuffle-once schemes. However, existing theory offers limited guidance on how to design new data-ordering schemes that further improve optimization constants or stability beyond random reshuffling. In this paper, we design a pipeline using a large language model (LLM)-guided program evolution framework to discover an effective shuffling rule for without-replacement SGD. Abstracting from this instance, we identify two fundamental structural components: block reshuffling and paired reversal. We analyze these components separately and show that block reshuffling strictly reduces prefix-gradient variance constants within the unified shuffling framework, yielding provable improvements over random reshuffling under mild conditions. Separately, we show that paired reversal symmetrizes the epoch map and cancels the leading order-dependent second-order term, reducing order sensitivity from quadratic to cubic in the step size. Numerical experiments with the discovered algorithm validate the theory and demonstrate consistent gains over standard shuffling schemes across convex and nonconvex benchmarks.

[LG-50] Informed Machine Learning with Knowledge Landmarks

链接: https://arxiv.org/abs/2604.00256
作者: Chuyi Dai,Witold Pedrycz,Suping Xu,Ding Liu,Xianmin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Informed Machine Learning has emerged as a viable generalization of Machine Learning (ML) by building a unified conceptual and algorithmic setting for constructing models on a unified basis of knowledge and data. Physics-informed ML involving physics equations is one of the developments within Informed Machine Learning. This study proposes a novel direction of Knowledge-Data ML, referred to as KD-ML, where numeric data are integrated with knowledge tidbits expressed in the form of granular knowledge landmarks. We advocate that data and knowledge are complementary in several fundamental ways: data are precise (numeric) and local, usually confined to some region of the input space, while knowledge is global and formulated at a higher level of abstraction. The knowledge can be represented as information granules and organized as a collection of input-output information granules called knowledge landmarks. In virtue of this evident complementarity, we develop a comprehensive design process of the KD-ML model and formulate an original augmented loss function L, which additively embraces the component responsible for optimizing the model based on available numeric data, while the second component, playing the role of a granular regularizer, so that it adheres to the granular constraints (knowledge landmarks). We show the role of the hyperparameter positioned in the loss function, which balances the contribution and guiding role of data and knowledge, and point to some essential tendencies associated with the quality of data (noise level) and the level of granularity of the knowledge landmarks. Experiments on two physics-governed benchmarks demonstrate that the proposed KD model consistently outperforms data-driven ML models.

[LG-51] Hierarchical Discrete Flow Matching for Graph Generation

链接: https://arxiv.org/abs/2604.00236
作者: Yoann Boget,Pablo Strasser,Alexandros Kalousis
类目: Machine Learning (cs.LG)
*备注: Graph, generation, hierarchical

点击查看摘要

Abstract:Denoising-based models, including diffusion and flow matching, have led to substantial advances in graph generation. Despite this progress, such models remain constrained by two fundamental limitations: a computational cost that scales quadratically with the number of nodes and a large number of function evaluations required during generation. In this work, we introduce a novel hierarchical generative framework that reduces the number of node pairs that must be evaluated and adopts discrete flow matching to significantly decrease the number of denoising iterations. We empirically demonstrate that our approach more effectively captures graph distributions while substantially reducing generation time.

[LG-52] Neural Collapse Dynamics: Depth Activation Regularisation and Feature Norm Threshold

链接: https://arxiv.org/abs/2604.00230
作者: Anamika Paul Rupa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural collapse (NC) – the convergence of penultimate-layer features to a simplex equiangular tight frame – is well understood at equilibrium, but the dynamics governing its onset remain poorly characterised. We identify a simple and predictive regularity: NC occurs when the mean feature norm reaches a model-dataset-specific critical value, fn*, that is largely invariant to training conditions. This value concentrates tightly within each (model, dataset) pair (CV 8%); training dynamics primarily affect the rate at which fn approaches fn*, rather than the value itself. In standard training trajectories, the crossing of fn below fn* consistently precedes NC onset, providing a practical predictor with a mean lead time of 62 epochs (MAE 24 epochs). A direct intervention experiment confirms fn* is a stable attractor of the gradient flow – perturbations to feature scale are self-corrected during training, with convergence to the same value regardless of direction (p0.2). Completing the (architecture)x(dataset) grid reveals the paper’s strongest result: ResNet-20 on MNIST gives fn* = 5.867 – a +458% architecture effect versus only +68% on CIFAR-10. The grid is strongly non-additive; fn* cannot be decomposed into independent architecture and dataset contributions. Four structural regularities emerge: (1) depth has a non-monotonic effect on collapse speed; (2) activation jointly determines both collapse speed and fn*; (3) weight decay defines a three-regime phase diagram – too little slows, an optimal range is fastest, and too much prevents collapse; (4) width monotonically accelerates collapse while shifting fn* by at most 13%. These results establish feature-norm dynamics as an actionable diagnostic for predicting NC timing, suggesting that norm-threshold behaviour is a general mechanism underlying delayed representational reorganisation in deep networks.

[LG-53] Risk-Aware Batch Testing for Performance Regression Detection

链接: https://arxiv.org/abs/2604.00222
作者: Ali Sayedsalehi,Peter C. Rigby,Gregory Mierzwinski
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 14 pages, 1 figure, 4 tables. Replication package and dataset available

点击查看摘要

Abstract:Performance regression testing is essential in large-scale continuous-integration (CI) systems, yet executing full performance suites for every commit is prohibitively expensive. Prior work on performance regression prediction and batch testing has shown independent benefits, but each faces practical limitations: predictive models are rarely integrated into CI decision-making, and conventional batching strategies ignore commit-level heterogeneity. We unify these strands by introducing a risk-aware framework that integrates machine-learned commit risk with adaptive batching. Using Mozilla Firefox as a case study, we construct a production-derived dataset of human-confirmed regressions aligned chronologically with Autoland, and fine-tune ModernBERT, CodeBERT, and LLaMA-3.1 variants to estimate commit-level performance regression risk, achieving up to 0.694 ROC-AUC with CodeBERT. The risk scores drive a family of risk-aware batching strategies, including Risk-Aged Priority Batching and Risk-Adaptive Stream Batching, evaluated through realistic CI simulations. Across thousands of historical Firefox commits, our best overall configuration, Risk-Aged Priority Batching with linear aggregation (RAPB-la), yields a Pareto improvement over Mozilla’s production-inspired baseline. RAPB-la reduces total test executions by 32.4%, decreases mean feedback time by 3.8%, maintains mean time-to-culprit at approximately the baseline level, reduces maximum time-to-culprit by 26.2%, and corresponds to an estimated annual infrastructure cost savings of approximately 491K under our cost model. These results demonstrate that risk-aware batch testing can reduce CI resource consumption while improving diagnostic timeliness. To support reproducibility and future research, we release a complete replication package containing all datasets, fine-tuning pipelines, and implementations of our batching algorithms. Comments: 14 pages, 1 figure, 4 tables. Replication package and dataset available Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2604.00222 [cs.SE] (or arXiv:2604.00222v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.00222 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Measuring the Representational Alignment of Neural Systems in Superposition

链接: https://arxiv.org/abs/2604.00208
作者: Sunny Liu,Habon Issa,André Longon,Liv Gorton,Meenakshi Khosla,David Klindt
类目: Machine Learning (cs.LG)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems’ respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.

[LG-55] Lead Zirconate Titanate Reservoir Computing for Classification of Written and Spoken Digits

链接: https://arxiv.org/abs/2604.00207
作者: Thomas Buckley,Leslie Schumm,Manor Askenazi,Edward Rietman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we extend our earlier work of (Rietman et al. 2022) presenting an application of physical Reservoir Computing (RC) to the classification of handwritten and spoken digits. We utilize an unpoled cube of Lead Zirconate Titanate (PZT) as a computational substrate to process these datasets. Our results demonstrate that the PZT reservoir achieves 89.0% accuracy on MNIST handwritten digits, representing a 2.4 percentage point improvement over logistic regression baselines applied to the same preprocessed data. However, for the AudioMNIST spoken digits dataset, the reservoir system (88.2% accuracy) performs equivalently to baseline methods (88.1% accuracy), suggesting that reservoir computing provides the greatest benefits for classification tasks of intermediate difficulty where linear methods underperform but the problem remains learnable. PZT is a well-known material already used in semiconductor applications, presenting a low-power computational substrate that can be integrated with digital algorithms. Our findings indicate that physical reservoirs excel when the task difficulty exceeds the capability of simple linear classifiers but remains within the computational capacity of the reservoir dynamics.

[LG-56] Unsupervised 4D Flow MRI Velocity Enhancement and Unwrapping Using Divergence-Free Neural Networks

链接: https://arxiv.org/abs/2604.00205
作者: Javier Bisbal,Julio Sotelo,Hernán Mella,Oliver Welin Odeback,Joaquín Mura,David Marlevi,Junya Matsuda,Kotomi Iwata,Tetsuro Sekine,Cristian Tejos,Sergio Uribe
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, 7 tables

点击查看摘要

Abstract:This work introduces an unsupervised Divergence and Aliasing-Free neural network (DAF-FlowNet) for 4D Flow Magnetic Resonance Imaging (4D Flow MRI) that jointly enhances noisy velocity fields and corrects phase wrapping artifacts. DAF-FlowNet parameterizes velocities as the curl of a vector potential, enforcing mass conservation by construction and avoiding explicit divergence-penalty tuning. A cosine data-consistency loss enables simultaneous denoising and unwrapping from wrapped phase images. On synthetic aortic 4D Flow MRI generated from computational fluid dynamics, DAF-FlowNet achieved lower errors than existing techniques (up to 11% lower velocity normalized root mean square error, 11% lower directional error, and 44% lower divergence relative to the best-performing alternative across noise levels), with robustness to moderate segmentation perturbations. For unwrapping, at peak velocity/velocity-encoding ratios of 1.4 and 2.1, DAF-FlowNet achieved 0.18% and 5.2% residual wrapped voxels, representing reductions of 72% and 18% relative to the best alternative method, respectively. In scenarios with both noise and aliasing, the proposed single-stage formulation outperformed a state-of-the-art sequential pipeline (up to 15% lower velocity normalized root mean square error, 11% lower directional error, and 28% lower divergence). Across 10 hypertrophic cardiomyopathy patient datasets, DAF-FlowNet preserved fine-scale flow features, corrected aliased regions, and improved internal flow consistency, as indicated by reduced inter-plane flow bias in aortic and pulmonary mass-conservation analyses recommended by the 4D Flow MRI consensus guidelines. These results support DAF-FlowNet as a framework that unifies velocity enhancement and phase unwrapping to improve the reliability of cardiovascular 4D Flow MRI.

[LG-57] Offline Constrained RLHF with Multiple Preference Oracles

链接: https://arxiv.org/abs/2604.00200
作者: Brenden Latham,Mehrdad Moharrami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.

[LG-58] Lévy-Flow Models: Heavy-Tail-Aware Normalizing Flows for Financial Risk Management

链接: https://arxiv.org/abs/2604.00195
作者: Rachid Drissi
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures, 7 tables

点击查看摘要

Abstract:We introduce Lévy-Flows, a class of normalizing flow models that replace the standard Gaussian base distribution with Lévy process-based distributions, specifically Variance Gamma (VG) and Normal-Inverse Gaussian (NIG). These distributions naturally capture heavy-tailed behavior while preserving exact likelihood evaluation and efficient reparameterized sampling. We establish theoretical guarantees on tail behavior, showing that for regularly varying bases the tail index is preserved under asymptotically linear flow transformations, and that identity-tail Neural Spline Flow architectures preserve the base distribution’s tail shape exactly outside the transformation region. Empirically, we evaluate on SP 500 daily returns and additional assets, demonstrating substantial improvements in density estimation and risk calibration. VG-based flows reduce test negative log-likelihood by 69% relative to Gaussian flows and achieve exact 95% VaR calibration, while NIG-based flows provide the most accurate Expected Shortfall estimates. These results show that incorporating Lévy process structure into normalizing flows yields significant gains in modeling heavy-tailed data, with applications to financial risk management.

[LG-59] Finite-Time Analysis of Projected Two-Time-Scale Stochastic Approximation

链接: https://arxiv.org/abs/2604.00179
作者: Yitao Bai,Thinh T. Doan,Justin Romberg
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:We study the finite-time convergence of projected linear two-time-scale stochastic approximation with constant step sizes and Polyak–Ruppert averaging. We establish an explicit mean-square error bound, decomposing it into two interpretable components, an approximation error determined by the constrained subspace and a statistical error decaying at a sublinear rate, with constants expressed through restricted stability margins and a coupling invertibility condition. These constants cleanly separate the effect of subspace choice (approximation errors) from the effect of the averaging horizon (statistical errors). We illustrate our theoretical results through a number of numerical experiments on both synthetic and reinforcement learning problems.

[LG-60] Predicting Wave Reflection and Transmission in Heterogeneous Media via Fourier Operator-Based Transformer Modeling

链接: https://arxiv.org/abs/2604.00132
作者: Zhe Bai,Hans Johansen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 9 figures, ACDSA 2026

点击查看摘要

Abstract:We develop a machine learning (ML) surrogate model to approximate solutions to Maxwell’s equations in one dimension, focusing on scenarios involving a material interface that reflects and transmits electro-magnetic waves. Derived from high-fidelity Finite Volume (FV) simulations, our training data includes variations of the initial conditions, as well as variations in one material’s speed of light, allowing for the model to learn a range of wave-material interaction behaviors. The ML model autoregressively learns both the physical and frequency embeddings in a vision transformer-based framework. By incorporating Fourier transforms in the latent space, the wave number spectra of the solutions aligns closely with the simulation data. Prediction errors exhibit an approximately linear growth over time with a sharp increase at the material interface. Test results show that the ML solution has adequate relative errors below 10% in over 75 time step rollouts, despite the presence of the discontinuity and unknown material properties.

[LG-61] Efficient Software Vulnerability Detection Using Transformer-based Models

链接: https://arxiv.org/abs/2604.00112
作者: Sameer Shaik,Zhen Huang,Daniela Stan Raicu,Jacob Furst
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Detecting software vulnerabilities is critical to ensuring the security and reliability of modern computer systems. Deep neural networks have shown promising results on vulnerability detection, but they lack the capability to capture global contextual information on vulnerable code. To address this limitation, we explore the application of transformers for C/C++ vulnerability detection. We use program slices that encapsulate key syntactic and semantic features of program code, such as API function calls, array usage, pointer manipulations, and arithmetic expressions. By leveraging transformers’ capability to capture both local and global contextual information on vulnerable code, our work can identify vulnerabilities accurately. Combined with data balancing and hyperparameter fine-tuning, our work offers a robust and efficient approach to identifying vulnerable code with moderate resource usage and training time.

[LG-62] Speeding Up Mixed-Integer Programming Solvers with Sparse Learning for Branching

链接: https://arxiv.org/abs/2604.00094
作者: Selin Bayramoğlu,George L Nemhauser,Nikolaos V Sahinidis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:Machine learning is increasingly used to improve decisions within branch-and-bound algorithms for mixed-integer programming. Many existing approaches rely on deep learning, which often requires very large training datasets and substantial computational resources for both training and deployment, typically with GPU parallelization. In this work, we take a different path by developing interpretable models that are simple but effective. We focus on approximating strong branching (SB) scores, a highly effective yet computationally expensive branching rule. Using sparse learning methods, we build models with fewer than 4% of the parameters of a state-of-the-art graph neural network (GNN) while achieving competitive accuracy. Relative to SCIP’s built-in branching rules and the GNN-based model, our CPU-only models are faster than the default solver and the GPU-accelerated GNN. The models are simple to train and deploy, and they remain effective with small training sets, which makes them practical in low-resource settings. Extensive experiments across diverse problem classes demonstrate the efficiency of this approach.

[LG-63] PASM: Population Adaptive Symbolic Mixture-of-Experts Model for Cross-location Hurricane Evacuation Decision Prediction

链接: https://arxiv.org/abs/2604.00074
作者: Xiao Qian,Shangjia Dong
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Accurate prediction of evacuation behavior is critical for disaster preparedness, yet models trained in one region often fail elsewhere. Using a multi-state hurricane evacuation survey, we show this failure goes beyond feature distribution shift: households with similar characteristics follow systematically different decision patterns across states. As a result, single global models overfit dominant responses, misrepresent vulnerable subpopulations, and generalize poorly across locations. We propose Population-Adaptive Symbolic Mixture-of-Experts (PASM), which pairs large language model guided symbolic regression with a mixture-of-experts architecture. PASM discovers human-readable closed-form decision rules, specializes them to data-driven subpopulations, and routes each input to the appropriate expert at inference time. On Hurricanes Harvey and Irma data, transferring from Florida and Texas to Georgia with 100 calibration samples, PASM achieves a Matthews correlation coefficient of 0.607, compared to XGBoost (0.404), TabPFN (0.333), GPT-5-mini (0.434), and meta-learning baselines MAML and Prototypical Networks (MCC \leq 0.346). The routing mechanism assigns distinct formula archetypes to subpopulations, so the resulting behavioral profiles are directly interpretable. A fairness audit across four demographic axes finds no statistically significant disparities after Bonferroni correction. PASM closes more than half the cross-location generalization gap while keeping decision rules transparent enough for real-world emergency planning.

[LG-64] Evolution Strategies for Deep RL pretraining

链接: https://arxiv.org/abs/2604.00066
作者: Adrian Martínez,Ananya Gupta,Hanka Goralija,Mario Rico,Saúl Fenollosa,Tamar Alphaidze
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 2 algorithms; EE-568 Reinforcement learning course project

点击查看摘要

Abstract:Although Deep Reinforcement Learning has proven highly effective for complex decision-making problems, it demands significant computational resources and careful parameter adjustment in order to develop successful strategies. Evolution strategies offer a more straightforward, derivative-free approach that is less computationally costly and simpler to deploy. However, ES generally do not match the performance levels achieved by DRL, which calls into question their suitability for more demanding scenarios. This study examines the performance of ES and DRL across tasks of varying difficulty, including Flappy Bird, Breakout and Mujoco environments, as well as whether ES could be used for initial training to enhance DRL algorithms. The results indicate that ES do not consistently train faster than DRL. When used as a preliminary training step, they only provide benefits in less complex environments (Flappy Bird) and show minimal or no improvement in training efficiency or stability across different parameter settings when applied to more sophisticated tasks (Breakout and MuJoCo Walker).

[LG-65] Large Language Models for Analyzing Enterprise Architecture Debt in Unstructured Documentation

链接: https://arxiv.org/abs/2604.00046
作者: Christin Pagels,Simon Hacks,Rob Henk Bemthuis
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Author version, 2 figures, 5 tables. To appear in the Proceedings of the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), 2026

点击查看摘要

Abstract:Enterprise Architecture Debt (EA Debt) arises from suboptimal design decisions and misaligned components that can degrade an organization’s IT landscape over time. Early indicators, Enterprise Architecture Smells (EA Smells), are currently mainly detected manually or only from structured artifacts, leaving much unstructured documentation under-analyzed. This study proposes an approach using a large language model (LLM) to identify and quantify EA Debt in unstructured architectural documentation. Following a design science research approach, we design and evaluate an LLM-based prototype for automated EA Smell detection. The artifact ingests unstructured documents (e.g., process descriptions, strategy papers), applies fine-tuned detection models, and outputs identified smells. We evaluate the prototype through a case study using synthetic yet realistic business documents, benchmarking against a custom GPT-based model. Results show that LLMs can detect multiple predefined EA Smells in unstructured text, with the benchmark model achieving higher precision and processing speed, and the fine-tuned on-premise model offering data protection advantages. The findings highlight opportunities for integrating LLM-based smell detection into EA governance practice.

[LG-66] ransformers for Program Termination

链接: https://arxiv.org/abs/2604.00039
作者: Yoav Alon,Cristina David
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Determining whether a program terminates is a core challenge in program analysis with direct implications for correctness, verification, and security. We investigate whether transformer architectures can recognise termination patterns directly from source code and how their strengths can be amplified through ensembles. To overcome the extreme scarcity of non-terminating examples, we design an ensemble framework of compact transformer encoders, systematically trained with a suite of imbalance-aware loss functions and class-aware sampling techniques. By combining models trained with distinct loss functions, our ensembles achieve substantially stronger performance than any single transformer, outperforming both powerful off-the-shelf LLMs and graph-based methods. Finally, we introduce an attribution pipeline that produces syntax-aware explanations for the termination estimation.

[LG-67] Learning and Generating Mixed States Prepared by Shallow Channel Circuits

链接: https://arxiv.org/abs/2604.01197
作者: Fangjun Hu,Christian Kokail,Milan Kornjača,Pedro L. S. Lopes,Weiyuan Gong,Sheng-Tao Wang,Xun Gao,Stefan Ostermann
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 44 pages, 13 figures, 1 table

点击查看摘要

Abstract:Learning quantum states from measurement data is a central problem in quantum information and computational complexity. In this work, we study the problem of learning to generate mixed states on a finite-dimensional lattice. Motivated by recent developments in mixed state phases of matter, we focus on arbitrary states in the trivial phase. A state belongs to the trivial phase if there exists a shallow preparation channel circuit under which local reversibility is preserved throughout the preparation. We prove that any mixed state in this class can be efficiently learned from measurement access alone. Specifically, given copies of an unknown trivial phase mixed state, our algorithm outputs a shallow local channel circuit that approximately generates this state in trace distance. The sample complexity and runtime are polynomial (or quasi-polynomial) in the number of qubits, assuming constant (or polylogarithmic) circuit depth and gate locality. Importantly, the learner is not given the original preparation circuit and relies only on its existence. Our results provide a structural foundation for quantum generative models based on shallow channel circuits. In the classical limit, our framework also inspires an efficient algorithm for classical diffusion models using only a polynomial overhead of training and generation.

[LG-68] Inverse Design of Optical Multilayer Thin Films using Robust Masked Diffusion Models

链接: https://arxiv.org/abs/2604.01106
作者: Jonas Schaible,Asena Karolin Özdemir,Charlotte Debus,Sven Burger,Achim Streit,Christiane Becker,Klaus Jäger,Markus Götz
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 24 pages, 14 Figures

点击查看摘要

Abstract:Inverse design of optical multilayer stacks seeks to infer layer materials, thicknesses, and ordering from a desired target spectrum. It is a long-standing challenge due to the large design space and non-unique solutions. We introduce \textttOptoLlama, a masked diffusion language model for inverse thin-film design from optical spectra. Representing multilayer stacks as sequences of material-thickness tokens, \textttOptoLlama conditions generation on reflectance, absorptance, and transmittance spectra and learns a probabilistic mapping from optical response to structure. Evaluated on a representative test set of 3,000 targets, \textttOptoLlama reduces the mean absolute spectral error by 2.9-fold relative to a nearest-neighbor template baseline and by 3.45-fold relative to the state-of-the-art data-driven baseline, called \textttOptoGPT. Case studies on designed and expert-defined targets show that the model reproduces characteristic spectral features and recovers physically meaningful stack motifs, including distributed Bragg reflectors. These results establish diffusion-based sequence modeling as a powerful framework for inverse photonic design.

[LG-69] Focal plane wavefront control with model-based reinforcement learning

链接: https://arxiv.org/abs/2604.00993
作者: Jalo Nousiainen,Iremsu Taskin,Markus Kasper,Gilles Orban De Xivry,Olivier Absil
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 13 pages, 11 figures accepted by AA

点击查看摘要

Abstract:The direct imaging of potentially habitable exoplanets is one prime science case for high-contrast imaging instruments on extremely large telescopes. Most such exoplanets orbit close to their host stars, where their observation is limited by fast-moving atmospheric speckles and quasi-static non-common-path aberrations (NCPA). Conventional NCPA correction methods often use mechanical mirror probes, which compromise performance during operation. This work presents machine-learning-based NCPA control methods that automatically detect and correct both dynamic and static NCPA errors by leveraging sequential phase diversity. We extend previous work in reinforcement learning for AO to focal plane control. A new model-based RL algorithm, Policy Optimization for NCPAs (PO4NCPA), interprets the focal-plane image as input data and, through sequential phase diversity, determines phase corrections that optimize both non-coronagraphic and post-coronagraphic PSFs without prior system knowledge. Further, we demonstrate the effectiveness of this approach by numerically simulating static NCPA errors on a ground-based telescope and an infrared imager affected by water-vapor-induced seeing (dynamic NCPAs). Simulations show that PO4NCPA robustly compensates static and dynamic NCPAs. In static cases, it achieves near-optimal focal-plane light suppression with a coronagraph and near-optimal Strehl without one. With dynamics NCPA, it matches the performance of the modal least-squares reconstruction combined with a 1-step delay integrator in these metrics. The method remains effective for the ELT pupil, vector vortex coronagraph, and under photon and background noise. PO4NCPA is model-free and can be directly applied to standard imaging as well as to any coronagraph. Its sub-millisecond inference times and performance also make it suitable for real-time low-order correction of atmospheric turbulence beyond HCI.

[LG-70] Multi-Mode Quantum Annealing for Variational Autoencoders with General Boltzmann Priors

链接: https://arxiv.org/abs/2604.00919
作者: Gilhan Kim,Daniel K. Park
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Variational autoencoders (VAEs) learn compact latent representations of complex data, but their generative capacity is fundamentally constrained by the choice of prior distribution over the latent space. Energy-based priors offer a principled way to move beyond factorized assumptions and capture structured interactions among latent variables, yet training such priors at scale requires accurate and efficient sampling from intractable distributions. Here we present Boltzmann-machine–prior VAEs (BM-VAEs) trained using quantum annealing–based sampling in three distinct operational modes within a single generative system. During training, diabatic quantum annealing (DQA) provides unbiased Boltzmann samples for gradient estimation of the energy-based prior; for unconditional generation, slower quantum annealing (QA) concentrates samples near low-energy minima; for conditional generation, bias fields are added to direct sampling toward attribute-specific regions of the energy landscape (c-QA). Using up to 2000 qubits on a D-Wave Advantage2 processor, we demonstrate stable and efficient training across multiple datasets, with faster convergence and lower reconstruction loss than a Gaussian-prior VAE. The learned Boltzmann prior enables unconditional generation by sampling directly from the energy-based latent distribution, a capability that plain autoencoders lack, and conditional generation through latent biasing that leverages the learned pairwise interactions.

[LG-71] Deconfounding Scores and Representation Learning for Causal Effect Estimation with Weak Overlap AISTATS2026

链接: https://arxiv.org/abs/2604.00811
作者: Oscar Clivio,Alexander D’Amour,Alexander Franks,David Bruns-Smith,Chris Holmes,Avi Feller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: To appear at AISTATS 2026

点击查看摘要

Abstract:Overlap, also known as positivity, is a key condition for causal treatment effect estimation. Many popular estimators suffer from high variance and become brittle when features differ strongly across treatment groups. This is especially challenging in high dimensions: the curse of dimensionality can make overlap implausible. To address this, we propose a class of feature representations called deconfounding scores, which preserve both identification and the target of estimation; the classical propensity and prognostic scores are two special cases. We characterize the problem of finding a representation with better overlap as minimizing an overlap divergence under a deconfounding score constraint. We then derive closed-form expressions for a class of deconfounding scores under a broad family of generalized linear models with Gaussian features and show that prognostic scores are overlap-optimal within this class. We conduct extensive experiments to assess this behavior empirically.

[LG-72] Inverse-Free Sparse Variational Gaussian Processes AISTATS2026

链接: https://arxiv.org/abs/2604.00697
作者: Stefano Cortinovis,Laurence Aitchison,Stefanos Eleftheriadis,Mark van der Wilk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to AISTATS 2026. 20 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.

[LG-73] Neural Ordinary Differential Equations for Modeling Socio-Economic Dynamics

链接: https://arxiv.org/abs/2604.00632
作者: Sandeep Kumar Samota,Snehashish Chakraverty,Narayan Sethi
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Poverty is a complex dynamic challenge that cannot be adequately captured using predefined differential equations. Nowadays, artificial machine learning (ML) methods have demonstrated significant potential in modelling real-world dynamical systems. Among these, Neural Ordinary Differential Equations (Neural ODEs) have emerged as a powerful, data-driven approach for learning continuous-time dynamics directly from observations. This chapter applies the Neural ODE framework to analyze poverty dynamics in the Indian state of Odisha. Specifically, we utilize time-series data from 2007 to 2020 on the key indicators of economic development and poverty reduction. Within the Neural ODE architecture, the temporal gradient of the system is represented by a multi-layer perceptron (MLP). The obtained neural dynamical system is integrated using a numerical ODE solver to obtain the trajectory of over time. In backpropagation, the adjoint sensitivity method is utilized for gradient computation during training to facilitate effective backpropagation through the ODE solver. The trained Neural ODE model reproduces the observed data with high accuracy. This demonstrates the capability of Neural ODE to capture the dynamics of the poverty indicator of concrete-structured households. The obtained results show that ML methods, such as Neural ODEs, can serve as effective tools for modeling socioeconomic transitions. It can provide policymakers with reliable projections, supporting more informed and effective decision-making for poverty alleviation.

[LG-74] Scenario theory for multi-criteria data-driven decision making

链接: https://arxiv.org/abs/2604.00553
作者: Simone Garatti,Lucrezia Manieri,Alessandro Falsone,Algo Carè,Marco C. Campi,Maria Prandini
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The scenario approach provides a powerful data-driven framework for designing solutions under uncertainty with rigorous probabilistic robustness guarantees. Existing theory, however, primarily addresses assessing robustness with respect to a single appropriateness criterion for the solution based on a dataset, whereas many practical applications - including multi-agent decision problems - require the simultaneous consideration of multiple criteria and the assessment of their robustness based on multiple datasets, one per criterion. This paper develops a general scenario theory for multi-criteria data-driven decision making. A central innovation lies in the collective treatment of the risks associated with violations of individual criteria, which yields substantially more accurate robustness certificates than those derived from a naive application of standard results. In turn, this approach enables a sharper quantification of the robustness level with which all criteria are simultaneously satisfied. The proposed framework applies broadly to multi-criteria data-driven decision problems, providing a principled, scalable, and theoretically grounded methodology for design under uncertainty.

[LG-75] Activation Saturation and Floquet Spectrum Collapse in Neural ODEs

链接: https://arxiv.org/abs/2604.00543
作者: Nikolaos M. Matzakos
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:We prove that activation saturation imposes a structural dynamical limitation on autonomous Neural ODEs \doth=f_\theta(h) with saturating activations ( \tanh , sigmoid, etc.): if q hidden layers of the MLP f_\theta satisfy |\sigma’|\le\delta on a region~ U , the input Jacobian is attenuated as \normDf_\theta(x)\le C(U) (for activations with \sup_x|\sigma’(x)|\le 1 , e.g.\ \tanh and sigmoid, this reduces to C_W\delta^q ), forcing every Floquet (Lyapunov) exponen along any T -periodic orbit \gamma\subset U into the interval [-C(U),;C(U)] . This is a collapse of the Floquet spectrum: as saturation deepens ( \delta\to 0 ), all exponents are driven to zero, limiting both strong contraction and chaotic sensitivity. The obstruction is structural – it constrains the learned vector field at inference time, independent of training quality. As a secondary contribution, for activations with \sigma’0 , a saturation-weighted spectral factorisation yields a refined bound \widetildeC(U)\le C(U) whose improvement is amplified exponentially in~ T at the flow level. All results are numerically illustrated on the Stuart–Landau oscillator; the bounds provide a theoretical explanation for the empirically observed failure of \tanh -NODEs on the Morris–Lecar neuron model.

[LG-76] Denoising distances beyond the volumetric barrier

链接: https://arxiv.org/abs/2604.00432
作者: Han Huang,Pakawut Jiradilok,Elchanan Mossel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study the problem of reconstructing the latent geometry of a d -dimensional Riemannian manifold from a random geometric graph. While recent works have made significant progress in manifold recovery from random geometric graphs, and more generally from noisy distances, the precision of pairwise distance estimation has been fundamentally constrained by the volumetric barrier, namely the natural sample-spacing scale n^-1/d coming from the fact that a generic point of the manifold typically lies at distance of order n^-1/d from the nearest sampled point. In this paper, we introduce a novel approach, Orthogonal Ring Distance Estimation Routine (ORDER), which achieves a pointwise distance estimation precision of order n^-2/(d+5) up to polylogarithmic factors in n in polynomial time. This strictly beats the volumetric barrier for dimensions d 5 . As a consequence of obtaining pointwise precision better than n^-1/d , we prove that the Gromov–Wasserstein distance between the reconstructed metric measure space and the true latent manifold is of order n^-1/d . This matches the Wasserstein convergence rate of empirical measures, demonstrating that our reconstructed graph metric is asymptotically as good as having access to the full pairwise distance matrix of the sampled points. Our results are proven in a very general setting which includes general models of noisy pairwise distances, sparse random geometric graphs, and unknown connection probability functions. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2604.00432 [stat.ML] (or arXiv:2604.00432v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.00432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels

链接: https://arxiv.org/abs/2604.00316
作者: Marcel Tomàs Bernal,Neil Rohit Mallinar,Mikhail Belkin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.

[LG-78] Genetic algorithms for multi-omic feature selection: a comparative study in cancer survival analysis

链接: https://arxiv.org/abs/2604.00065
作者: Luca Cattelani,Vittorio Fortino
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-omic datasets offer opportunities for improved biomarker discovery in cancer research, but their high dimensionality and limited sample sizes make identifying compact and effective biomarker panels challenging. Feature selection in large-scale omics can be efficiently addressed by combining machine learning with genetic algorithms, which naturally support multi-objective optimization of predictive accuracy and biomarker set size. However, genetic algorithms remain relatively underexplored for multi-omic feature selection, where most approaches concatenate all layers into a single feature space. To address this limitation, we introduce Sweeping*, a multi-view, multi-objective algorithm alternating between single- and multi-view optimization. It employs a nested single-view multi-objective optimizer, and for this study we use the genetic algorithm NSGA3-CHS. It first identifies informative biomarkers within each layer, then jointly evaluates cross-layer interactions; these multi-omic solutions guide the next single-view search. Through repeated sweeps, the algorithm progressively identifies compact biomarker panels capturing cross-modal complementary signals. We benchmark five Sweeping* strategies, including hierarchical and concatenation-based variants, using survival prediction on three TCGA cohorts. Each strategy jointly optimizes predictive accuracy and set size, measured via the concordance index and root-leanness. Overall performance and estimation error are assessed through cross hypervolume and Pareto delta under 5-fold cross-validation. Our results show that Sweeping* can improve the accuracy-complexity trade-off when sufficient survival signal is present and that integrating omic layers can enhance survival prediction beyond clinical-only models, although benefits remain cohort-dependent.

[LG-79] Forecast collapse of transformer-based models under squared loss in financial time series

链接: https://arxiv.org/abs/2604.00064
作者: Pierre Andreoletti(IDP)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:We study trajectory forecasting under squared loss for time series with weak conditional structure, using highly expressive prediction models. Building on the classical characterization of squared-loss risk minimization, we emphasize regimes in which the conditional expectation of future trajectories is effectively degenerate, leading to trivial Bayes-optimal predictors (flat for prices and zero for returns in standard financial settings). In this regime, increased model expressivity does not improve predictive accuracy but instead introduces spurious trajectory fluctuations around the optimal predictor. These fluctuations arise from the reuse of noise and result in increased prediction variance without any reduction in bias. This provides a process-level explanation for the degradation of Transformerbased forecasts on financial time series. We complement these theoretical results with numerical experiments on high-frequency EUR/USD exchange rate data, analyzing the distribution of trajectory-level forecasting errors. The results show that Transformer-based models yield larger errors than a simple linear benchmark on a large majority of forecasting windows, consistent with the variance-driven mechanism identified by the theory.

[LG-80] Scaled Gradient Descent for Ill-Conditioned Low-Rank Matrix Recovery with Optimal Sampling Complexity

链接: https://arxiv.org/abs/2604.00060
作者: Zhenxuan Li,Meng Huang
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The low-rank matrix recovery problem seeks to reconstruct an unknown n_1 \times n_2 rank- r matrix from m linear measurements, where m\ll n_1n_2 . This problem has been extensively studied over the past few decades, leading to a variety of algorithms with solid theoretical guarantees. Among these, gradient descent based non-convex methods have become particularly popular due to their computational efficiency. However, these methods typically suffer from two key limitations: a sub-optimal sample complexity of O((n_1 + n_2)r^2) and an iteration complexity of O(\kappa \log(1/\epsilon)) to achieve \epsilon -accuracy, resulting in slow convergence when the target matrix is ill-conditioned. Here, \kappa denotes the condition number of the unknown matrix. Recent studies show that a preconditioned variant of GD, known as scaled gradient descent (ScaledGD), can significantly reduce the iteration complexity to O(\log(1/\epsilon)) . Nonetheless, its sample complexity remains sub-optimal at O((n_1 + n_2)r^2) . In contrast, a delicate virtual sequence technique demonstrates that the standard GD in the positive semidefinite (PSD) setting achieves the optimal sample complexity O((n_1 + n_2)r) , but converges more slowly with an iteration complexity O(\kappa^2 \log(1/\epsilon)) . In this paper, through a more refined analysis, we show that ScaledGD achieves both the optimal sample complexity O((n_1 + n_2)r) and the improved iteration complexity O(\log(1/\epsilon)) . Notably, our results extend beyond the PSD setting to general low-rank matrix recovery problem. Numerical experiments further validate that ScaledGD accelerates convergence for ill-conditioned matrices with the optimal sampling complexity.

[LG-81] Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part II-On the Strength of Weak Learnability and the Boosting Paradigm

链接: https://arxiv.org/abs/2604.00038
作者: Ernest Fokoué,Gregory Babbitt,Yuval Levental
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 4 tables

点击查看摘要

Abstract:In Part I of this series, we established a rigorous mathematical isomorphism between ant colony decision-making and random forest learning, demonstrating that variance reduction through decorrelation is a universal principle shared by biological and computational ensembles. Here we turn to the complementary mechanism: bias reduction through adaptive weighting. Just as boosting algorithms sequentially focus on difficult instances, ant colonies dynamically amplify successful foraging paths through pheromone-mediated recruitment. We prove that these processes are mathematically isomorphic, establishing that the fundamental theorem of weak learnability has a direct analog in colony decision-making. We develop a formal mapping between AdaBoost’s adaptive reweighting and ant recruitment dynamics, show that the margin theory of boosting corresponds to the stability of quorum decisions, and demonstrate through comprehensive simulation that ant colonies implementing adaptive recruitment achieve the same bias-reduction benefits as boosting algorithms. This completes a unified theory of ensemble intelligence, revealing that both variance reduction (Part I) and bias reduction (Part II) are manifestations of the same underlying mathematical principles governing collective intelligence in biological and computational systems.

[LG-82] Decomposable Reward Modeling and Realistic Environment Design for Reinforcement Learning-Based Forex Trading

链接: https://arxiv.org/abs/2604.00031
作者: Nabeel Ahmad Saidd
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Applying reinforcement learning (RL) to foreign exchange (Forex) trading remains challenging because realistic environments, well-defined reward functions, and expressive action spaces must be satisfied simultaneously, yet many prior studies rely on simplified simulators, single scalar rewards, and restricted action representations, limiting both interpretability and practical relevance. This paper presents a modular RL framework designed to address these limitations through three tightly integrated components: a friction-aware execution engine that enforces strict anti-lookahead semantics, with observations at time t, execution at time t+1, and mark-to-market at time t+1, while incorporating realistic costs such as spread, commission, slippage, rollover financing, and margin-triggered liquidation; a decomposable 11-component reward architecture with fixed weights and per-step diagnostic logging to enable systematic ablation and component-level attribution; and a 10-action discrete interface with legal-action masking that encodes explicit trading primitives while enforcing margin-aware feasibility constraints. Empirical evaluation on EURUSD focuses on learning dynamics rather than generalization and reveals strongly non-monotonic reward interactions, where additional penalties do not reliably improve outcomes; the full reward configuration achieves the highest training Sharpe (0.765) and cumulative return (57.09 percent). The expanded action space increases return but also turnover and reduces Sharpe relative to a conservative 3-action baseline, indicating a return-activity trade-off under a fixed training budget, while scaling-enabled variants consistently reduce drawdown, with the combined configuration achieving the strongest endpoint performance.

附件下载

点击下载今日全部论文列表