本篇博文主要内容为 2026-05-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-19)
今日共更新1524篇论文,其中:
- 自然语言处理共201篇(Computation and Language (cs.CL))
- 人工智能共567篇(Artificial Intelligence (cs.AI))
- 计算机视觉共374篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共504篇(Machine Learning (cs.LG))
- 多智能体系统共29篇(Multiagent Systems (cs.MA))
- 信息检索共36篇(Information Retrieval (cs.IR))
- 人机交互共69篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Beyond Scaling: Agents Are Heading to the Edge
【速读】:该论文试图解决的问题是:当前生成式 AI(Generative AI)代理智能的瓶颈已从将世界知识压缩到单一模型,转变为如何构建一个能够协调执行的系统。其核心挑战在于,传统以云为中心的设计无法有效支持代理任务对高保真本地上下文的结构耦合以及零延迟执行循环的需求。解决方案的关键在于三大结构性转变:第一,“前额叶转向”(Prefrontal Turn),即代理能力的核心杠杆从预训练规模转向框架级执行控制,且这种控制必须物理上靠近行动环境以维持认知一致性;第二,“数据地理悖论”(Data-Geography Paradox),即本地文件系统、实时传感器流和临时操作系统状态等“暗物质”数据在传输至云端后会失真或失效,导致代理失去真实上下文;第三,“交互对齐环”(interaction-alignment loop),即唯一经济且生态可持续的代理优化数据来源,是通过本地实时交互产生的高保真隐式偏好信号。这些转变共同论证了个人代理架构必须向边缘迁移。
链接: https://arxiv.org/abs/2605.18535
作者: Chunlin Tian,Dongqi Cai,Wanru Zhao,Nicholas D. Lane
机构: University of Cambridge; University of Macau; Nanjing University
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:The bottleneck of useful agentic intelligence has shifted from compressing world knowledge into a single model to executing a coordinated system. This position paper argues that personal-agent architecture must move to the edge because the core properties of agentic intelligence tasks, particularly their structural coupling with high-fidelity local context and the need for zero-latency execution loops, do not sit well with cloud-centric designs. We develop this claim through three structural shifts. First, the Prefrontal Turn: the main marginal lever of capability has moved from pre-training scale to framework-level executive control. Such control must remain physically close to the environment of action if the agent is to preserve cognitive alignment. Second, the Data-Geography Paradox, the ``dark matter’’ of agentic data (local file hierarchies, real-time sensor streams, and transient OS states) degrades, disappears, or loses meaning once prepared for cloud transmission, thereby cutting the agent off from ground-truth context. Third, the interaction-alignment loop, the only economically and ecologically sustainable source of agentic refinement data is the high-fidelity implicit preference signal produced through real-time local interaction. Third, the interaction-alignment loop, the only economically and ecologically sustainable source of agentic refinement data is the high-fidelity implicit preference signal produced through real-time local interaction. We conclude with falsifiable predictions for the next deployment cycle of personal agents.
[MA-1] he Dynamics of Policy Gradient in Social Dilemmas with Partner Selection
【速读】:该论文试图解决的问题是:在社会困境中,自利的学习智能体如何在合作(促进群体利益)与背叛(获取即时奖励)之间做出选择,尤其是在存在伙伴选择机制时,如何从理论上解析合作行为的演化机制。解决方案的关键在于通过政策梯度(policy-gradient)动力学建模,分析伙伴选择如何改变对手分布并重塑奖励景观,并证明在简单规则下这种机制能促进合作;特别地,论文指出种群方差是合作出现的必要条件,并基于二维Wiener过程构建随机模型以刻画伙伴选择的随机效应,进而推导出种群具有合作促进性的充分条件及稳态分布的存在性,仿真结果验证了该理论框架对学习动态的准确捕捉能力。
链接: https://arxiv.org/abs/2605.18185
作者: Benedict Russell,Chin-wing Leung,Paolo Turrini
机构: Warwick University (华威大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.
[MA-2] LLM -Guided Communication for Cooperative Multi-Agent Reinforcement Learning ICML2026
【速读】:该论文试图解决多智能体强化学习(MARL)中因部分可观测性导致的信息不足问题,现有方法往往依赖低效的信息交换或无法传递充分的状态信息。解决方案的关键在于提出基于大语言模型(LLM)的多智能体通信机制(LMAC),利用LLM的推理能力设计通信协议,使所有智能体能够尽可能准确且一致地重构环境状态;LMAC通过显式的状态感知准则迭代优化协议,提升状态恢复精度并缩小智能体间知识差异。实验表明,LMAC在多种MARL基准任务上显著优于现有通信基线方法。
链接: https://arxiv.org/abs/2605.18077
作者: Sangjun Bae,Yisak Park,Sanghyeon Lee,Seungyul Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9 pages for main, 32 pages for total, Accepted to ICML 2026
Abstract:Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM’s reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents’ knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.
[MA-3] Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning ICML2026
【速读】:该论文试图解决多智能体强化学习(MARL)中协作机制在外部扰动下脆弱性的问题,特别是当智能体间的交互结构本身被破坏时,现有鲁棒MARL方法因主要针对价值导向攻击而缺乏有效应对。解决方案的关键在于提出一种基于信息论的对抗学习框架——交互断裂对抗学习(IBAL),通过扰动智能体的观测和动作来构造破坏协作的攻击,并训练智能体在这些干扰下仍能可靠执行任务。实验表明,该方法在多种攻击场景下均优于现有鲁棒MARL基线,且在智能体缺失情况下表现更优。
链接: https://arxiv.org/abs/2605.18024
作者: Sunwoo Lee,Mingu Kang,Yonghyeon Jo,Seungyul Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages for main, 27 pages for total, Accepted to ICML 2026
Abstract:Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents’ observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios.
[MA-4] Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自主经济代理(autonomous economic agents)在实际市场环境中可能引发的系统性风险问题,尤其是其集体行为可能导致市场波动加剧和欺骗行为规模化扩散。解决方案的关键在于提出一个名为“Agent Bazaar”的多代理仿真框架,用于评估经济对齐能力(Economic Alignment),即代理系统维持市场稳定性和完整性(market stability and integrity)的能力;同时识别出两种典型失败模式:B2C市场的“算法不稳定性”(The Crash)和C2C市场的“Sybil欺骗”(The Lemon Market)。研究进一步发现,当前前沿与开源模型普遍缺乏自我调节能力,且失败严重程度因模型而异而非规模相关。为此,作者设计了两类经济对齐机制——“稳定型企业”(Stabilizing Firms)与“怀疑守护者”(Skeptical Guardians),并通过REINFORCE++结合自适应课程训练方法,成功训练出一个90亿参数模型,在多项指标上超越所有被测模型。最终提出的经济对齐分数(Economic Alignment Score, EAS)是一个包含稳定性、完整性、福利与盈利能力四个维度的标量指标,实现了跨模型的量化比较,并证明经济对齐能力可独立于通用能力进行针对性强化学习训练。
链接: https://arxiv.org/abs/2605.17698
作者: Seth Karten,Cameron Crow,Chi Jin
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 17 pages, 9 figures
Abstract:The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market (“The Crash”), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market (“The Lemon Market”), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open-weight models across both scenarios and find that models largely fail to self-regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open-weight models. We propose the Economic Alignment Score (EAS), a 4-component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross-model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.
[MA-5] Reservation Based Smart Parking Management
【速读】:该论文旨在解决智能城市与智能交通系统(ITS)中因用户未按时离场导致的“预约车位被占用”(即“No PARK”现象)问题,从而提升停车管理的可靠性与资源利用率。其解决方案的关键在于提出一种双机制架构:一是通过动态缓冲区机制(动态大小的不可预订车位缓冲区)优化停车位可用性与预约成功率之间的平衡;二是引入基于声誉的奖励机制,利用“星级评分”对用户行为进行奖惩,以经济惩罚和访问限制激励用户准时离场。仿真结果表明,该方案能有效减少“No PARK”事件并提升整体资源利用效率,显著增强城市运行的可持续性。
链接: https://arxiv.org/abs/2605.17650
作者: Giacomo Cabri,Manuela Montangero,Filippo Muzzini,Roberto Wang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 6 pages, accepted at the IEEE WETICE 2026 Conference
Abstract:In the framework of Smart Cities and Intelligent Transportation Systems (ITS), efficient parking management is essential to reduce urban congestion and emissions. However, current reservation-based systems often encounter a scenario in which users find their reserved slot occupied by a previous occupant who failed to vacate on time (“No PARK” situation). This paper introduces a dual-mechanism architecture designed to enhance system reliability. A Reservation Module uses a dynamic size buffer of non-reservable slots to grant parking availability. A reputation-based Reward System exploits a “star-based” metric to incentivize punctual departures through financial penalties and access restrictions. The simulations conducted with the SUMO urban simulator are promising, showing that the dynamic buffer strategy provides a better tradeoff between parking availability and reservation success. By progressively adapting to users behavior, the proposed system mitigates “NO PARK” instances and improves resource utilization, significantly enhancing urban viability. Index Terms-Smart City, Intelligent transportation systems, Parking, Reservation systems, V2I, Reputation-based mechanisms, Smart Parking
[MA-6] Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports
【速读】:该论文旨在解决无效缺陷报告(invalid bug reports)的自动化分类与无代码修复建议生成问题,这类报告占比较大且常无需修改代码即可解决,但当前依赖人工处理造成资源浪费。解决方案的关键在于构建一个面向根本原因(root-cause)的标准化分类体系,并通过三种方法——基础大语言模型(vanilla LLMs)、检索增强生成(Retrieval Augmented Generation, RAG)和代理式网络搜索(agentic web search)——进行实验验证。研究发现,RAG在无效子类识别上表现最佳(加权F1为0.66),而代理式网络搜索在生成无代码修复建议方面效果最优(Judge LLM成功率达68.9%),表明结合外部信息检索与推理能力的方法更适用于复杂场景下的自动分类与修复决策。
链接: https://arxiv.org/abs/2605.17561
作者: Mahmut Furkan Gon,Emre Dinc,Tevfik Emre Sungur,Eray Tuzun
机构: Bilkent University (比尔肯大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 19 pages, 4 figures, 7 tables
Abstract:Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.
[MA-7] Human-Flow Digital Twin for Predicting the Effects of Mobility Introduction on Visitor Circulation MDM2026
【速读】:该论文旨在解决如何量化评估引入新型移动方式(如共享单车、电动滑板车等)对人流分布和游客行为影响的问题。其核心挑战在于准确预测人类流动模式的变化,尤其是在复杂环境中(如公园或景区)的动态响应。解决方案的关键在于构建一个基于多智能体模拟器的人流数字孪生框架,通过分析干预前的人流数据、各景点间的距离、吸引力指标及出行量,训练每个代理的决策模型——该模型以访客当前状态和环境信息为输入,输出下一目的地选择。通过将移动措施建模为点间距离或吸引力的变化,该框架可在模拟器中重现引入移动方式后的人流,并量化游客数量与流动路径的变化。实验验证表明,在日本和歌山城公园的数据上,使用多层感知机决策模型时,空间人口分布的余弦相似度超过0.7,证明该方法能有效复现移动引入带来的人流变化。
链接: https://arxiv.org/abs/2605.17426
作者: Chiharu Shima,Haruki Yonekura,Fukuharu Tanaka,Tatsuya Amano,Hirozumi Yamaguchi
机构: Wakayama Prefectural University (和歌山県立大学); Toyota Motor Corporation (丰田汽车公司); Kintone Co., Ltd. (Kintone有限公司); Yamaha Motor Co., Ltd. (雅马哈发动机有限公司)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: An accepted paper at the 27th IEEE International Conference on Mobile Data Management (MDM 2026). Project page: this https URL
Abstract:We propose a framework for predicting the effects of mobility introduction measures using a human-flow digital twin. This digital twin incorporates a multi-agent simulator that can represent how visitors choose destinations depending on factors such as their current location and the attractiveness of spots. We extract data on how visitors selected destinations with respect to measured pre-intervention human-flow data, inter-spot distances, spot attractiveness, and travel volumes, and use these data to train each agent’s decision model of this simulator. The trained decision model is a function that takes a visitor’s current state and surrounding environmental information as input and outputs which spot the visitor will move toward next. By expressing mobility introduction measures as changes to inter-point distances or to spot attractiveness, the framework can reproduce human flows with mobility introduction in the multi-agent simulator and thereby quantify effects such as changes in visitor counts and circulation. We evaluated the proposed method using human-flow data measured with and without introducing mobility within Wakayama Castle Park in Japan. When reproducing flows with mobility introduction using a multi-layer perceptron decision model, the cosine similarity of the spatial population distribution exceeded 0.7, confirming that the approach can replicate the flow changes caused by the mobility introduction.
[MA-8] Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning
【速读】:该论文试图解决协作式多智能体强化学习(MARL)中协调图(Coordination Graphs)的拓扑结构学习问题,即如何在不依赖启发式规则的前提下,理论性地确定哪些边应当存在以及每条边应承载的信息量。现有方法缺乏对图结构学习的理论保障,且无法根据智能体间结构差异合理分配通信容量。解决方案的关键在于提出异质信息瓶颈协调图(Heterogeneous Information-Bottleneck Coordination Graphs, HIBCG),其核心是基于图信息瓶颈(GIB)构建一个群体对齐的块对角先验(group-aligned block-diagonal prior),从而通过闭式准则决定每个群组块内的边保留及其密度;在此基础上,进一步在所得拓扑上控制每个智能体的特征带宽,压缩消息以保留任务相关的内容。论文证明了该先验能严格收紧拓扑学习的变分界、目标可按群组块分解实现差异化边控制,并且容量分配遵循水填充(water-filling)原则。
链接: https://arxiv.org/abs/2605.17393
作者: Wei Duan,Junyu Xuan,En Yu,Xiaoyu Yang,Jie Lu
机构: University of Technology Sydney (悉尼科技大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention – determining which edges should exist and at what density per group block – and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.
[MA-9] ask Capability Improvement Algorithm for Collaborative Manipulators
【速读】:该论文旨在解决多机械臂协同操作中任务能力受限的问题,特别是由于作用力施加在物体质心以外位置时产生的 undesired moment(非期望力矩)所导致的性能瓶颈。其解决方案的关键在于:主动利用这些原本被视为干扰的附加力矩来增强单个机械臂的任务执行能力,从而提升整个协作群体的作业效能。这种力矩的“正向利用”不仅提升了对象搬运能力,还实现了最优能力配置、资源最优分配以及最大容错性,仿真结果表明相较不使用附加力矩的情况,整体能力提升了5.86%。
链接: https://arxiv.org/abs/2605.17293
作者: Keshab Patra,Arpita Sinha,Anirban Guha
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:This work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object’s grasp point. Applying forces at a point other than the object’s center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group’s enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 % compared to when no moment is used to enhance the capability of the manipulators.
[MA-10] MetaCogAgent : A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation
【速读】:该论文试图解决多智能体大语言模型(Multi-agent Large Language Model, LLM)系统在协作执行复杂任务时,因缺乏对自身能力边界的认知而导致的过度自信问题。现有框架通常基于预定义角色分配任务,而未考虑智能体是否能准确评估自身胜任力,从而引发在非专业领域错误执行的问题。解决方案的关键在于引入元认知机制——MetaCogAgent框架为每个智能体配备一个元认知自我评估单元(Metacognitive Self-Assessment Unit),通过融合语义化不确定性与历史能力特征来量化任务-能力匹配度,并结合自适应委托协议和能力边界学习模块实现动态任务路由与持续优化。实验表明,该方法在MetaCog-Eval基准上达到82.4%的任务准确率,显著优于基线方法,同时减少API调用开销,验证了元认知组件的有效性。
链接: https://arxiv.org/abs/2605.17292
作者: Chenyu Wang,Yang Shu
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, submitted to IEEE SMC 2026
Abstract:Multi-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent’s competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy – 8.7% above the best routing baseline – while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.
[MA-11] Bimodal Synchronization Performance: Why Noise and Sparse Connectivity Can Improve Collective Timing
【速读】:该论文试图解决分布式系统中基于萤火虫同步机制的脉冲耦合振子模型在何种参数条件下能够实现集体同步的问题。其解决方案的关键在于揭示了同步性能对“阈值-脉冲持续时间”平衡的敏感性:只有当邻近脉冲邻居的比例阈值与脉冲持续时间处于临界平衡区域时,系统才能出现近完美的同步;在此区域内,系统表现出双峰性能——要么达成全局同步,要么陷入稳定的多簇状态(即对称相位偏移的子群相互强化而阻碍全局同步)。研究进一步表明,降低连接度或引入噪声可抑制低性能的多簇状态,说明高度连通或无噪声的系统并非最优同步配置。
链接: https://arxiv.org/abs/2605.17206
作者: Till Aust,Tianfu Zhang,Andreagiovanni Reina,Heiko Hamann
机构: 1. University of Stuttgart (斯图加特大学); 2. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 3. University of Oxford (牛津大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Pulse-coupled oscillator models inspired by firefly synchronization are widely used to study decentralized time coordination in distributed systems. We analyze a discrete-time, discrete-phase firefly-inspired synchronization model and show that collective synchrony emerges only near a critical balance between the quorum threshold (fraction of pulsing neighbors required to trigger a phase update) and the pulse duration (how long agents remain detectable to others). Within this parameter region, the system exhibits bimodal performance: it either reaches near-perfect synchronization or becomes trapped in stable multi-cluster states, where symmetrically phase-offset subgroups mutually reinforce one another and prevent global synchrony. Our analysis shows that reducing connectivity or introducing noise suppresses these low-performance states by breaking such symmetric interactions, indicating that highly connected or noiseless systems are not necessarily optimal for collective synchronization.
[MA-12] Multi-LLM Systems Exhibit Robust Semantic Collapse
【速读】:该论文试图解决的问题是:多大语言模型(multi-large language model, multi-LLM)系统在闭环运行中是否能够持续生成具有语义多样性的新内容,从而回答关于机器能否自主产生新颖知识的长期哲学争议。解决方案的关键在于揭示了此类系统在长时间运行中表现出的“语义坍缩”(semantic collapse)现象——即尽管表面词汇形式多样,但语义表征却趋于一致收敛。作者通过大规模模拟(200–1,000轮)和十二种干预策略(包括解码参数调整、提示设计、代理组合、激活工程及强化学习)均未能恢复语义多样性,进一步表明该现象并非由对齐或从众偏差导致,而是源于自回归生成机制本身的内在特性。这一发现揭示了当前多LLM系统在封闭回路环境中难以维持开放知识生产的根本限制。
链接: https://arxiv.org/abs/2605.17193
作者: Weiyi Kong,Shiyang Lai,Jinghua Piao,James Evans
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 64 pages, 8 figures, 7 tables; includes Supplementary Information
Abstract:Whether machines can originate novel content has been debated for nearly two centuries, from Lovelace’s assertion that no engine can “originate anything” to Turing’s question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings.
[MA-13] Responsible Agent ic AI Requires Explicit Provenance
【速读】:该论文试图解决的问题是:当前生成式 AI(Agentic AI)在软件工程等现实场景中快速普及,但公众信任未能同步提升,其根本原因在于“责任”这一概念仍停留在主观层面,缺乏可量化、可追溯且可干预的来源证明(provenance),导致在由多方共同设计的系统组合引发损害时无法明确责任归属。解决方案的关键在于建立显式来源证明(explicit provenance),即贯穿整个智能体生命周期的结构化记录机制,它通过因果归因函数与责任张量(responsibility tensor)形式化编码责任信息,并在四个生命周期层级上实现计算可行性和在线干预能力,从而将责任转化为可计算、可行动的机制。论文强调,这种显式来源证明不是可选优化项,而是负责任的生成式 AI 的必要前提,所有利益相关方均不可忽视。
链接: https://arxiv.org/abs/2605.17169
作者: Jinwei Hu,Xinmiao Huang,Qisong He,Youcheng Sun,Yi Dong,Xiaowei Huang
机构: University of Liverpool (利物浦大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Under Review
Abstract:Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but \textbfexplicit provenance across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing \textitwhy such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing \textitwhat it must encode through a causal attribution function and responsibility tensor, discussing \textithow it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining \textitwho bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.
[MA-14] MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
【速读】:该论文旨在解决企业环境中文档处理自动化面临的挑战,即传统人工方式效率低且易出错。其解决方案的关键在于提出一种多智能体架构(Multi-Agent Document Processing, MADP),该架构融合了基于深度学习的分类与解析技术、大语言模型(Large Language Model, LLM)的信息抽取能力,并通过选择性的人工验证机制(Human-in-the-Loop, HITL)保障准确性。MADP由五个专业化智能体组成——分类器(Classificator)、拆分器(Splitter)、解析器(Parser)、提取器(Extraction)和验证器(Validator),并引入创新的提示微调方法(Prompt Fine Tuning with Feedback Inheritance, PFTFI),实现高效、高精度的端到端文档处理流程。实证结果表明,该方案在年处理10万张发票场景下可减少约70%的全职人力需求,在真实世界955份文档中实现97.0%的全流程自动化率,仅3%需人工介入;同时,系统在文档级准确率达到98.5%,并显著降低碳排放(69%)、能耗(69%)及水资源消耗(63%),展现出卓越的性能与可持续性优势。
链接: https://arxiv.org/abs/2605.17159
作者: Diego Gosmar,Giovanni Zenezini
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 5 figures
Abstract:Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents–Classificator, Splitter, Parser, Extraction, and Validator–with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.
[MA-15] S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
【速读】:该论文试图解决多智能体系统中因共享可变自然语言状态而导致的结构竞争条件(Structural Race Conditions, SRCs)问题,这类竞争条件包括写-写冲突和跨分片的过时读取冲突,会无声地破坏智能体输出的一致性。现有框架(如LangGraph、CrewAI、AutoGen)缺乏对共享状态的写所有权语义,无法有效防止此类错误。解决方案的关键是提出S-Bus,一种基于HTTP中间件的机制,其核心为服务端的DeliveryLog——一个按智能体记录HTTP GET操作的日志,能够在不修改智能体SDK的前提下,于提交时刻自动重建每个智能体的读集(read set),从而实现可观测读隔离(Observable-Read Isolation, ORI),这是一种针对HTTP可观测投影的偏因果一致性保证,能有效避免协作智能体在共享分片场景下产生结构竞争。论文通过三项贡献验证了该方案的有效性:(C1) DeliveryLog机制的三层次形式化证明(TLAPS机器检查、TLC模型检测与Dafny归纳推理);(C2) 实验表明在427,308次HTTP 409冲突测试中,S-Bus与PostgreSQL SERIALIZABLE和Redis WATCH/MULTI在防止结构冲突方面达到一致性能,且零类型I腐蚀;(C3) ORI的适用范围受拓扑约束:在专用分片任务中无害,在单分片协作写入场景中可能传播并发矛盾,需谨慎使用。
链接: https://arxiv.org/abs/2605.17076
作者: Sajjad Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 24 pages, 23 tables. Code, formal proofs, and experimental harness available at: this https URL
Abstract:Concurrent LLM agents sharing mutable natural-language state produce Structural Race Conditions (SRCs): write-write and cross-shard stale-read conflicts that silently corrupt agent output. Existing multi-agent frameworks (LangGraph, CrewAI, AutoGen) provide no write-ownership semantics over shared state. We present S-Bus, an HTTP middleware whose central mechanism is a server-side DeliveryLog: a per-agent log of HTTP GET operations that automatically reconstructs each agent’s read set at commit time without agent SDK changes under HTTP/1.1. The consistency property the DeliveryLog provides – Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable projection of the read set – prevents structural race conditions when agents collaborate via shared shards. Three contributions: (C1) The DeliveryLog mechanism for automatic HTTP-traffic-based read-set reconstruction, with three-tier mechanised evidence: ReadSetSoundness and ORICommitSafety machine-checked in TLAPS (modulo one retained typing axiom); exhaustive TLC at N=3 (20,763,484 distinct states, zero violations); Dafny discharges 9 inductive soundness lemmas. (C2) Empirical structural-conflict prevention parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI on shared-shard contention sweeps with 427,308 active HTTP-409 conflicts: zero Type-I corruptions across all three backends. (C3) ORI’s operating envelope is topology-conditional: semantically neutral in dedicated-shard workloads; harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. Source code: this https URL Comments: 24 pages, 23 tables. Code, formal proofs, and experimental harness available at: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA) Cite as: arXiv:2605.17076 [cs.LG] (or arXiv:2605.17076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17076 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sajjad Khan [view email] [v1] Sat, 16 May 2026 16:46:27 UTC (58 KB)
[MA-16] PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning
【速读】:该论文试图解决的问题是:当前智能体系统(agentic systems)在处理长期经验时,缺乏对多模态记忆(multimodal memory)的有效建模,而多模态记忆在现实世界应用中至关重要。与单一模态记忆相比,多模态记忆面临异构输入融合、以个体为中心的信息对齐以及跨粒度证据聚合等挑战。解决方案的关键在于提出 PyraVid,一个受认知科学中事件分割理论(Event Segmentation Theory)启发的分层多模态记忆框架。PyraVid 将长视频组织成由粗到细的金字塔结构,实现结构化记忆访问和有效的证据聚合,并通过结构引导的记忆扩展与剪枝机制,在保留强因果关联但语义相似度低的相关事件的同时降低噪声,从而提升长时程推理能力。实验表明,PyraVid 在多个长视频理解基准上均显著优于基线方法,验证了分层多模态记忆的有效性。
链接: https://arxiv.org/abs/2605.17065
作者: Sikuan Yan,Sicheng Dong,Haotong Wang,Ercong Nie,Yilun Liu,Jinhe Bi,Yingjie Xu,Susanna Schwarzmann,Riccardo Trivisonno,Volker Tresp,Yunpu Ma
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.
[MA-17] Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
【速读】:该论文旨在解决多层级供应链中自主生成式AI代理(autonomous generative AI agents)在实际部署时面临的性能不可靠问题,尤其是由决策不一致性引发的“代理鞭子效应”(agent bullwhip effect)。该效应表现为:在时间维度上同一设施内决策波动加剧,在空间维度上不同设施间决策方差扩大,从而导致系统级成本上升和稳定性下降。解决方案的关键在于提出一种基于Group Relative Policy Optimization (GRPO) 的强化学习后训练框架,通过使用供应链整体奖励信号对共享的基础大语言模型(LLM)进行微调,使代理在协作与信息延迟环境下实现更稳定、可靠的决策行为,显著降低极端事件发生概率并缓解代理鞭子效应。
链接: https://arxiv.org/abs/2605.17036
作者: Carol Xuan Long,David Simchi-Levi,Feng Zhu,Huangyuan Su,Andre P. Calmon,Flavio P. Calmon
机构: Harvard University (哈佛大学); Kempner Institute (肯普纳研究所); Georgia Tech (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce the agent bullwhip effect, the amplification of decision unreliability across echelons, manifesting along two dimensions: decision variance increases both across facilities at the same point in time and within the same facility across time. We develop a mathematical framework showing that this phenomenon is inherent to multi-agent systems that involve coordination and information delays, and we demonstrate that repeated sampling fails to meaningfully reduce it. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. GRPO post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.
[MA-18] Lifelong LaCAM with Local Guidance for Lifelong MAPF
【速读】:该论文试图解决的是在终身多智能体路径规划(Lifelong Multi-Agent Path Finding, LMAPF)场景下,如何利用局部引导(Local Guidance)技术提升实时规划的性能和任务完成吞吐量的问题。传统方法在连续到达的任务流中难以维持高效率与可扩展性,而本文提出LLLG(Lifelong LaCAM with Local Guidance),其关键在于引入了滚动时域窗口化规划框架(receding-horizon windowed planning framework)以及每步迭代中对前一时刻解的热启动(warm-starts guidance),从而在密集环境中保持高吞吐量并实现比现有方法更优的长期性能表现。
链接: https://arxiv.org/abs/2605.16855
作者: Tomoki Arita,Keisuke Okumura
机构: 1. University of Tokyo (东京大学); 2. RIKEN (理化学研究所)
类目: Multiagent Systems (cs.MA)
备注: 10 pages, 11 figures, accepted to SoCS 2026
Abstract:Local guidance has recently proven to be a powerful driver of empirical performance in real-time, suboptimal multi-agent pathfinding (MAPF), improving the scalable configuration-based solver LaCAM. By injecting informative spatiotemporal cues around each agent, local guidance mitigates congestion, reduces waiting, and remains scalable enough even with tight time budgets, yielding state-of-the-art performance for one-shot MAPF. This study asks whether the same benefits can be lifted to the lifelong setting (LMAPF), where tasks arrive continuously and improvements in per-step plans can increase task completion throughput over long horizons. We propose LLLG, a Lifelong version of LaCAM enhanced with Local Guidance, which employs a receding-horizon windowed planning framework and warm-starts guidance from the previous solution at each timestep. Our method scales effectively, maintains high throughput even in compact, dense environments, and surpasses existing planners, thereby pushing the frontier of real-time, lifelong MAPF.
[MA-19] Dynamic Deployment of Mobile Charging Trucks During Natural Disaster Evacuation: An Offline-to-Online Framework
【速读】:该论文旨在解决大规模疏散过程中电动汽车(EV)集中充电需求导致固定充电站(FCSs)过载的问题,从而引发等待时间延长和风险暴露增加。其核心解决方案是通过动态部署移动充电车(MCTs)来补充FCSs,并提出了一种自适应风险感知的MCT部署框架(ARMD)。该框架将问题分解为两个子问题:一是基于风险感知的MCT在FCS间的分配,二是MCT到指定FCS的动态路径规划;并通过离线到在线的范式进行求解。其中,资源分配被建模为去中心化的部分可观测马尔可夫决策过程(Decentralized Partially Observable Markov Decision Process, Dec-POMDP),并采用多智能体近端策略优化(MAPPO)算法生成协同控制策略,该策略在离线仿真环境中预训练,并根据实时疏散情境在线自适应调整;路径规划则依赖于时空旅行时间预测模型支持滚动时域更新。实验结果表明,ARMD在减少风险暴露方面显著优于多种基线方法,在需求扰动场景下平均风险暴露降低达71.1%,在基础设施或道路中断情况下降低幅度为39.3%–60.5%,且随着扰动严重程度提升优势更加明显,验证了其在不确定疏散环境下的有效性与鲁棒性。
链接: https://arxiv.org/abs/2605.16784
作者: Rui Ma,Zilin Bian,Kaan Ozbay
机构: New York University(纽约大学); Rochester Institute of Technology(罗切斯特理工学院)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:During large-scale evacuations, concentrated electric vehicle (EV) charging demand can overload fixed charging stations (FCSs), leading to prolonged waiting time and increased risk exposure. To address this challenge, this study proposes dynamically deploying mobile charging trucks (MCTs) to complement FCSs, and develops an Adaptive Risk-aware MCT Deployment (ARMD) framework for real-time operation. It divides the MCT deployment into two problems: risk-aware allocation of MCTs among FCSs and dynamic routing of MCTs to the assigned FCSs, and solves them under an offline-to-online paradigm. The resource allocation problem is formulated as a decentralized partially observable Markov decision process, and a multi-agent proximal policy optimization (MAPPO)-based policy is developed to coordinate multiple MCTs under decentralized observations. The policy is pre-trained offline in an evacuation simulator and adaptively refined online according to current evacuation context. For routing, a spatio-temporal travel time predictor is developed to support rolling-horizon route updates. The proposed framework is evaluated in a simulated hurricane evacuation environment built using real-world data from Hillsborough County, Florida. Experiments show that ARMD consistently outperforms offline optimization, online heuristic dispatch, and rolling-horizon optimization in reducing risk exposure. For demand perturbation scenarios, ARMD reduces average risk exposure by up to 71.1%, relative to the baseline without MCTs. In the case of fixed e-vehicle charging infrastructure or road link failures, ARMD achieves 39.3% to 60.5% reduction in average risk exposure, with its advantages becoming more pronounced as the severity of disruption increases. These results demonstrate the effectiveness and robustness of ARMD in enhancing mobile charging operations for realistic scenarios of uncertain evacuation conditions.
[MA-20] NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
【速读】:该论文试图解决多智能体语言系统(Multi-agent language systems)依赖人工设计工作流导致灵活性与可扩展性不足的问题。传统方法中,智能体被预先分配语义角色且通信协议固定,难以适应复杂任务的动态需求。其解决方案的关键在于提出 NeuroMAS——一种将多智能体系统建模为可训练、类神经网络架构的方法:智能体节点无角色限制但具备结构感知能力(structure-aware),仅通过拓扑结构定义信息流动的潜在路径,而具体的通信、分工与协作则由强化学习决定。这一范式将多智能体系统的设计从“工作流工程”转变为“架构设计”,使深度、宽度、连接性和生长协议成为可扩展的能力来源。此外,理论分析表明,当任务具有层次分解特性时,这种模块化的文本计算方式更具参数效率;实验验证了 NeuroMAS 在推理时和训练时基线上的显著性能提升,并发现组织规模扩展具有路径依赖性——直接从零训练大系统困难,但若从已训练的小系统逐步增长,则可行性强。这表明学习型神经多智能体系统是大型语言模型(LLM)未来重要的可扩展方向。
链接: https://arxiv.org/abs/2605.16757
作者: Haoran Lu,Luyang Fang,Wenxuan Zhong,Ping Ma
机构: University of Georgia (佐治亚大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.
[MA-21] Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned Self-Correcting Video Generation
【速读】:该论文试图解决生成式视频模型在企业环境中应用时面临的两大核心问题:时间不一致性(temporal inconsistencies)和品牌偏差(brand misalignment),尤其是现有单体架构难以强制执行严格的品牌约束,常导致生成未经批准的视觉资产。解决方案的关键在于提出一种名为Genflow的复合人工智能系统,其核心创新包括两个部分:一是基于检索的“品牌DNA”提取模块,用于将企业品牌规范参数化以指导生成过程;二是对抗性多智能体质量控制(Adversarial Multi-Agent Quality Control, QC)循环,通过多个评估代理对生成帧进行迭代批判并驱动生成模型持续优化,直至达成确定性共识。这一多阶段自校正流水线使品牌合规视频生成的成功率从42%提升至89%,为可扩展的企业级生成式媒体系统提供了可靠框架。
链接: https://arxiv.org/abs/2605.16748
作者: Debanshu Das,Lavi Nigam,Sunil Kumar Jang Bahadur,Gopala Dhar
机构: Google(谷歌)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links
Abstract:Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based ‘Brand DNA’ extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.
[MA-22] GRASP: Graph Agent ic Search over Propositions for Multi-hop Question Answering
【速读】:该论文旨在解决多跳问答(multi-hop question answering)中传统代理检索系统(agentic retrieval)在准确率与计算效率之间难以平衡的问题,尤其是现有方法依赖知识图谱时带来的高昂索引构建成本和推理阶段的Token消耗问题。其解决方案的关键在于提出Graph Agentic Search over Propositions (GRASP),该系统通过将多跳查询分解为依赖感知的计划来动态调整子代理数量,并引入一个三层层次化图结构(实体层、命题层、段落层)实现高效检索:实体层用于精准遍历,命题层结合倒数排名投票机制提升召回率。GRASP在多个基准数据集上实现了最高准确率的同时,token使用量显著减少(比IRCoT+HippoRAG2少40–50%,比最优对比方法少30%),并提出了“成功经济”(success economy)指标以衡量单位正确答案的加权平均Token成本,推动效率导向的评估成为标准实践。
链接: https://arxiv.org/abs/2605.16598
作者: Stockton Jenkins,Ramya Korlakai Vinayak,Junjie Hu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic retrieval improves multi-hop question answering by giving language models autonomy to iteratively gather evidence. Recent work augments these systems with knowledge graphs for structured traversal, but this combination introduces significant cost: expensive graph construction at index time and compounding token usage at inference time. We introduce Graph Agentic Search over Propositions (GRASP), an agentic system that simultaneously optimizes for high accuracy and minimal token usage in multi-hop question answering. Rather than executing a rigid, singular query, GRASP actively coordinates its retrieval strategy by decomposing multi-hop queries into dependency-aware plans. This enables GRASP to dynamically scale the number of sub-agents according to the complexity of the problem. Each sub-agent resolves its single-hop query by exploring a novel three-layer hierarchical graph of entities, propositions, and passages, using the entity layer for targeted traversal and the proposition layer for high-recall passage retrieval via reciprocal-rank voting. We evaluate GRASP on MuSiQue, 2WikiMultihopQA, and HotpotQA under two settings: open-corpus retrieval and extended context reasoning (LongBench). GRASP achieves the highest QA accuracy in the open retrieval setting on MuSiQue and 2Wiki while using 40-50 percent fewer tokens than IRCoT+HippoRAG2. Furthermore, GRASP leads on EM and F1 across all three datasets in the LongBench setting while using 30 percent fewer tokens than the next most accurate method. Finally, we introduce success economy - the amortized token cost per correct answer, weighted by difficulty - and advocate for efficiency-aware evaluation as a standard practice for agentic QA.
[MA-23] A Mechanistic Model for Collective Motion from Sensorimotor Regularities
【速读】:该论文试图解决的问题是:传统自驱动粒子模型虽然能描述动物群体行为的宏观现象,但缺乏对集体行为如何由个体感知-运动机制实际产生的解释,即这些模型本质上是描述性的而非机制性的。解决方案的关键在于构建一个基于机器人学建模框架的、在感官-运动层面运作的机制模型——每个个体通过有限视野内的方位角和表观尺寸线索感知邻近个体,维持不确定的内部状态估计,并通过梯度下降优化期望社会距离来选择动作,无需预设交互力。该模型能够重现极化运动、环形旋转、环状结构及亚群分裂等多种集体行为,且全局敏感性分析表明,行为转变由可测量的生物参数(如视野几何、感官噪声、转向敏捷性和记忆能力)决定,从而将集体行为理解为感官-运动规律相互作用的涌现结果,物种间的差异则源于身体形态和环境的不同。
链接: https://arxiv.org/abs/2605.16522
作者: Vito Mengers,Bao Duc Cao,Oliver Brock
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Collective behavior in animals has long been modeled through self-propelled particle models, which reproduce striking group-level phenomena through abstract interaction forces. Yet these models are fundamentally descriptive: they leave open the question of how collective behavior is actually produced. Recent empirical work makes this gap concrete: locusts do not align with neighbors, sensory and cognitive mechanisms mediate interaction instead. A mechanistic model must therefore operate at the sensorimotor level, grounded in what individual organisms can actually perceive, estimate, and physically execute. We present such a model based on a modeling framework from robotics, extended here to collective motion. Each agent perceives neighbors through bearing and apparent-size cues within a limited field of view, maintains uncertain internal state estimates, and selects actions through gradient descent on a desired social distance – without any prescribed interaction forces. This simple model produces diverse collective behaviors including polarized motion, milling, ring formations, and subgroup fragmentation. A global sensitivity analysis shows that behavioral transitions are governed by sensorimotor parameters corresponding to measurable biological quantities: field of view geometry, sensory noise, turning agility, and memory. Collective behavior can therefore be understood as the emergent outcome of interacting sensorimotor regularities, and differences across species as the emergent outcome of differences in embodiment and environment.
[MA-24] ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning
【速读】:该论文试图解决的问题是:基于大语言模型(LLM)的智能体虽然能够对单次执行错误进行恢复,但在面对重复性故障时仍会失败,这是因为其底层过程知识(如操作符模式、前置条件和约束)未被修复。现有自演化方法主要通过更新提示词、记忆或模型权重来改进性能,但均未直接修复编码任务执行逻辑的符号结构,且缺乏安全部署所需的治理保障。解决方案的关键在于提出ANNEAL——一种神经符号智能体,它将重复性故障转化为对过程知识图谱的受控符号编辑,而不修改基础模型权重;其核心机制Failure-Driven Knowledge Acquisition (FDKA) 能定位故障操作符,通过约束式LLM生成类型化补丁,并经多维评分、符号护栏和Canary测试验证后提交;所有已接受的修改均具备完整溯源信息与确定性回滚能力。实验表明,ANNEAL是唯一能在四个领域中实现持久结构修复的系统,显著优于ReAct和Reflexion等强基线模型(后者在重复故障场景下仍保持72–100%失败率,而ANNEAL将其降至0%),消融实验证明FDKA对结构修复至关重要。
链接: https://arxiv.org/abs/2605.16309
作者: Safayat Bin Hakim,Keyan Guo,Wenkai Tan,Alvaro Velasquez,Shouhuai Xu,Houbing Herbert Song
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); University at Buffalo (纽约州立大学布法罗分校); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Colorado Colorado Springs (科罗拉多大学斯普林斯分校)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Code Implementation: this https URL
Abstract:LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge–operator schemas, preconditions, and constraints–remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs–strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72-100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.
[MA-25] Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics ICRA2026
【速读】:该论文试图解决的问题是:在由物理机器人组成的多智能体生态系统中,人类授权(consent)如何在代理委托链中传播、退化甚至失效。现有AI伦理和人机交互(HRI)研究主要关注数字软件代理或一对一的人机互动场景,未能涵盖机器人之间任务委托对人类产生影响的情形。解决方案的关键在于提出“授权链退化”(Consent Chain Degradation, CCD)这一概念框架,用以分析人类授权的特异性、有效性与范围如何随机器人之间的委托传递而逐步削弱,并设计了一个三层次治理架构——具身代理的授权运行时验证框架(Consent Runtime Verification Framework for Embodied Agents, CoRVE),其核心组件包括授权范围建模、委托链追踪与物理不可逆性评估。实证表明,CCD在医疗、家庭和工业机器人场景中均会实际发生,且现行欧盟《人工智能法案》《通用数据保护条例》(GDPR)、《机械指令》及《产品责任指令》均未覆盖CCD的核心维度,存在显著监管空白。
链接: https://arxiv.org/abs/2605.16300
作者: Mehmet Haklidir
机构: TÜBİTAK BİLGEM (土耳其科学与工业研究委员会信息科技研究所)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted for oral presentation at the 2nd Workshop on Robot Ethics (WoRoBet), ICRA 2026, Vienna, Austria, June 1, 2026. 6 pages, 3 tables, 1 figure
Abstract:Robotic systems are moving from isolated platforms to interconnected multi-agent ecosystems that operate in human environments. This shift raises a governance problem that existing frameworks do not address: how does consent propagate, degrade, and break down across chains of delegation between embodied autonomous agents? The AI ethics community has begun to study consent for digital software agents, and the HRI community has examined consent in dyadic human-robot encounters. Neither body of work covers what happens when physical robots delegate tasks to other robots in ways that affect humans. This paper introduces consent chain degradation (CCD), a conceptual framework for analyzing how the specificity, validity, and scope of human consent erodes as authority passes through multi-robot delegation chains. We propose a three-layer governance architecture, the Consent Runtime Verification Framework for Embodied Agents (CoRVE), which integrates consent scope modeling, delegation chain tracking, and physical irreversibility assessment. Three scenarios in healthcare, domestic, and industrial robotics show how CCD arises in practice, including a worked numerical example. A regulatory gap analysis covering the EU AI Act, the GDPR, the Machinery Regulation, and the Revised Product Liability Directive shows that all four instruments leave core CCD dimensions unaddressed.
[MA-26] Data-driven and distributed governance of building facilities management using decentralized autonomous organization digital twin and large language models
【速读】:该论文旨在解决传统人工智能(AI)与数据驱动的设施管理方法在建筑运营效率提升方面的局限性,特别是由集中式组织结构引发的网络安全脆弱性、情境理解能力不足以及决策过程缺乏关键利益相关者参与的问题。其解决方案的关键在于提出一个融合去中心化自治组织(DAO)、数字孪生(Digital Twin)、大语言模型(LLM)和区块链技术的分布式治理框架,通过DAO实现透明的集体决策,借助数字孪生与物联网(IoT)实现数据驱动的管理,利用LLM提供增强型决策支持,并以区块链保障建筑自动化系统的安全性,从而构建一个可扩展、安全且用户友好的智能建筑管理系统。
链接: https://arxiv.org/abs/2605.16298
作者: Reachsak Ly,Alireza Shojaei,Xinghua Gao,Philip Agee,Abiola Akanmu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 33 pages, 20 figures, 4 tables
Abstract:While traditional AI and data-driven facilities management approaches have improved building operational efficiency, they remain constrained by centralized organizational structures that are vulnerable to cyber attacks, limited contextual understanding, and decision-making processes that exclude key stakeholders from governance. This paper introduces a novel AI- and data-driven distributed governance framework for smart building management that integrates decentralized autonomous organizations (DAOs), digital twins, large language models (LLMs), and blockchain technology. The framework enables transparent collective decision-making through a DAO governance platform, implements data-driven management using IoT and digital twins, incorporates LLM-based virtual assistants for enhanced decision support, and utilizes blockchain for secure building automation. A full-stack decentralized application was developed to facilitate user interaction with these integrated components. The system was evaluated for cost efficiency, scalability, data security, and usability using the System Usability Scale (SUS). Expert interviews were also conducted to assess its practical benefits and implementation challenges.
[MA-27] Efficient Gradient Methods for Distributed Saddle Problems
【速读】:该论文旨在解决分布式鞍点问题(Saddle Problems, SPs)在理论基础方面的不足,特别是在通信成本和计算复杂度上的优化问题。其解决方案的关键在于提出了一种新颖的解耦方法(decoupled method),该方法通过多阶段将问题转化为残差范数的解耦最小化,从而在零尊重框架(zero-respecting framework)内实现了最优通信复杂度。该方法不仅在通信成本上优于现有最佳结果,还突破了外梯度法(Extragradient method)长期存在的Oracle复杂度瓶颈,并通过匹配的下界证明其在梯度跨度算法族中具有通信最优性。此外,论文还将该方法扩展至变分不等式问题(Variational Inequality Problem, VIP),进一步在更广泛的多人非零和博弈场景中达到了新的通信复杂度最优水平。
链接: https://arxiv.org/abs/2605.18604
作者: Ruichen Luo,Anton Rodomanov,Sebastian U. Stich
机构: 未知
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:The distributed setting for Saddle Problems (SPs) has recently emerged as a framework for various modern applications in machine learning and multiagent systems. Despite its relevance, the theoretical foundations of this setting have not yet been thoroughly established. In this paper, we advance this research direction by formalizing the distributed setup for SPs and providing rigorous definitions of communication and computational costs. Our main result is a novel decoupled method that achieves optimal communication cost within the zero-respecting framework. Our method is based on a multi-stage reduction to the decoupled minimization of residual norms, which yields strict improvements over the best known communication cost for the class and the long-standing oracle cost of the Extragradient method. Further, we show by a matching lower bound that our method is communication-optimal within the family of gradient-span algorithms. Finally, we study the extension of distributed SP into Variational Inequality Problem (VIP), which generalizes two-player zero-sum games to multiplayer general-sum games. We show that our decoupled method achieves a new state-of-the-art communication complexity for this broader class.
[MA-28] Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
【速读】:该论文试图解决的问题是:在人工社会中,当决策规则本身可被集体修改时,如何实现有效的群体决策与持续的规则适应性。解决方案的关键在于发现并验证“非单调性”规律——即群体适应能力并非随模型规模单调提升,而是在特定中等规模区间内达到最优表现,此时系统能维持规则采纳的持续性、修正多样性及共识平衡;过小的模型缺乏规则改变动力,过大的模型则趋向于投票模式僵化,混合规模群体则陷入否决权主导的瘫痪状态。这一规律在不同温度扰动和投票机制(一致同意转为多数决)下依然稳定,表明其本质源于模型内部表征结构与群体行为之间的选择性耦合,而非简单的表征差异。
链接: https://arxiv.org/abs/2605.17510
作者: Kazuya Horibe,Masaomi Hatakeyama,Gen Masumoto,Takashi Hashimoto,Peter Romero
机构: 未知
类目: Adaptation and Self-Organizing Systems (nlin.AO); Multiagent Systems (cs.MA)
备注:
Abstract:We study group decision-making in artificial societies where the rules of play are themselves subject to collective amendment. Using the self-amending game Nomic, we compare multiple scales across two LLM families and find that collective adaptation does not improve monotonically with model size. Instead, both families exhibit a narrow mid-scale regime that supports sustained rule adoption, diverse amendments, and balanced consensus. Smaller models tend to remain rule-inert, whereas larger models often converge on restrictive voting patterns, and heterogeneous mixed-size groups collapse into veto-driven gridlock. These cross-scale contrasts persist under temperature perturbations and under a shift from unanimity to majority voting, although latent-state structure varies by family and scale. Hidden-state divergence alone does not explain collective performance: high representational divergence can coincide with poor behavioural outcomes. Linear probes reveal regime-selective coupling between latent vote-predictive signals and collective behaviour, but decodability is necessary rather than sufficient for adaptive play. Overall, the recurring regularity is non-monotonicity, not the particular scale at which the optimum appears. Self-amending games therefore provide a controlled testbed for studying collective adaptation in artificial societies beyond raw model scale.
自然语言处理
[NLP-0] DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
【速读】: 该论文旨在解决现有分层注意力机制(如NSA和InfLLMv2)中存在的两个关键问题:一是top-k选择操作假设每个查询的相关token数量固定,限制了模型的灵活性;二是该操作阻断了稀疏阶段与密集阶段之间的梯度流动,导致训练不充分。其解决方案的核心在于提出了一种可微且自适应的分层注意力机制——DashAttention,它利用自适应稀疏的α-entmax变换在第一阶段动态选择不同数量的键值(KV)块作为先验信息,从而支持第二阶段的softmax注意力计算,并保持整个层级结构的端到端可微性。相比其他方法,DashAttention具有非发散性(non-dispersive),显著提升了长上下文建模能力,在大语言模型(LLMs)实验中实现了与全注意力相当的准确率,同时达到75%稀疏度,并在高稀疏场景下优于NSA和InfLLMv2,且通过Triton实现的高效GPU版本在推理时相较FlashAttention-3获得显著加速。
链接: https://arxiv.org/abs/2605.18753
作者: Yuxiang Huang,Nuno M. T. Gonçalves,Federico Alvetreti,Lei Li,Xu Han,Edoardo M. Ponti,André F. T. Martins,Marcos V. Treviso
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse \alpha -entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.
[NLP-1] Code as Agent Harness
【速读】: 该论文试图解决的问题是:如何将代码(code)从传统的生成目标转变为智能体(agent)运行的核心基础设施,以支持更复杂、可执行、可验证和状态持续的自主系统。其解决方案的关键在于提出“代码即智能体 harness”的统一视角——即把代码视为连接智能体推理、行动与环境建模的接口、实现长期任务执行机制(如规划、记忆、工具使用)、以及支持多智能体协作与验证的底层架构。论文通过三个层次系统梳理这一框架:(1)harness 接口层,明确代码在智能体与外部世界交互中的核心作用;(2)harness 机制层,涵盖反馈驱动的控制优化与长期执行能力;(3)扩展层,探讨单智能体到多智能体系统的规模化部署,强调共享代码作为协同与一致性保障的基础。最终,该研究为构建具备可执行性、可验证性和状态保持能力的下一代 agentic AI 系统提供了结构化路线图。
链接: https://arxiv.org/abs/2605.18747
作者: Xuying Ning,Katherine Tieu,Dongqi Fu,Tianxin Wei,Zihao Li,Yuanchen Bei,Jiaru Zou,Mengting Ai,Zhining Liu,Ting-Wei Li,Lingjie Chen,Yanjun Zhao,Ke Yang,Bingxuan Li,Cheng Qian,Gaotang Li,Xiao Lin,Zhichen Zeng,Ruizhong Qiu,Sirui Chen,Yifan Sun,Xiyuan Yang,Ruida Wang,Rui Pan,Chenyuan Yang,Dylan Zhang,Liri Fang,Zikun Cui,Yang Cao,Pan Chen,Dorothy Sun,Ren Chen,Mahesh Srinivasan,Nipun Mathur,Yinglong Xia,Hong Li,Hong Yan,Pan Lu,Lingming Zhang,Tong Zhang,Hanghang Tong,Jingrui He
机构: University of Illinois Urbana-Champaign; Meta; Stanford University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: GitHub: this https URL
Abstract:Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.
[NLP-2] ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
【速读】: 该论文试图解决传统空间智能研究中对被动感知的依赖问题,即现有方法通常假设存在“理想观测”(oracle observations),而忽视了智能体通过主动探索与环境交互以获取任务相关证据的能力。其核心挑战在于如何构建一个能模拟真实世界中“感知-行动循环”的评估框架,并揭示当前多模态大语言模型(MLLMs)在具身空间推理中的局限性。解决方案的关键在于提出ESI-BENCH——一个基于OmniGibson、融合Spelke核心知识系统的综合性基准,涵盖10类任务和29个子类,要求智能体自主决策感知、移动和操作能力的部署顺序,从而主动积累信息。实验表明,主动探索显著优于被动观察,且智能体能自发形成有效空间策略;但多数失败源于“行动盲视”(action blindness),即错误的动作选择导致次优观测并引发连锁错误。此外,研究发现3D表征虽在深度敏感任务上提升稳定性,但不完善的3D表示反而因扭曲空间关系而损害性能,同时人类行为对比揭示模型存在元认知缺陷:缺乏根据矛盾证据修正信念的能力,这无法仅靠增强感知或增加具身交互来弥补。
链接: https://arxiv.org/abs/2605.18746
作者: Yining Hong,Jiageng Liu,Han Yin,Manling Li,Leonidas Guibas,Li Fei-Fei,Jiajun Wu,Yejin Choi
机构: Stanford University (斯坦福大学); UCLA (加州大学洛杉矶分校); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: this https URL
Abstract:Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke’s core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.
[NLP-3] Vision-OPD: Learning to See Fine Details for Multimodal LLM s via On-Policy Self-Distillation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在细粒度视觉理解任务中的性能瓶颈问题,即模型难以聚焦于图像中对答案至关重要的微小但关键区域(small but decisive evidence),而非缺乏局部识别能力。其核心解决方案是提出Vision-OPD(Vision On-Policy Distillation),一种基于区域到全局的自蒸馏框架,通过让模型自身将“以裁剪区域为条件”的优越感知能力迁移至“以完整图像为条件”的推理策略中。该方法构建两个由同一MLLM衍生的条件策略:一个以裁剪图像为输入的教师模型和一个以全图输入的学生模型;学生模型在训练过程中生成在线策略轨迹(on-policy rollouts),而Vision-OPD通过最小化教师与学生在这些轨迹上的逐标记分布差异来实现知识迁移。此机制使模型无需外部教师模型、标注数据或推理时工具即可内化“视觉缩放”带来的优势,在多个细粒度视觉理解基准测试中表现优于或媲美更大规模的开源、闭源及“带图像思考”的智能体模型。
链接: https://arxiv.org/abs/2605.18740
作者: Qianhao Yuan,Jie Lou,Xing Yu,Hongyu Lin,Le Sun,Xianpei Han,Yaojie Lu
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学); Xiaohongshu Inc. (小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model’s own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and “Thinking-with-Images” agentic models.
[NLP-4] Predictable Confabulations: Factual Recall by LLM s Scales with Model Size and Topic Frequency
【速读】: 该论文试图解决的问题是:如何量化模型规模(model size)和训练数据组成(training-data composition)对大型语言模型事实回忆能力(factual recall)的影响,此前尚无明确的缩放定律(scaling law)能将这两者与回忆质量关联起来。解决方案的关键在于发现了一个新的缩放规律——回忆质量在模型参数量和训练数据中特定主题表示的对数线性组合上呈S形(sigmoid)变化,这一关系能够解释16个密集模型中60%的方差,而在单个模型家族内可提升至74–94%。该现象可通过一种受超位置(superposition)启发的机制解释:回忆能力由信噪比(signal-to-noise ratio)决定,其中信号强度与概念频率成正比,而噪声基底则与模型容量相关。
链接: https://arxiv.org/abs/2605.18732
作者: Matthew L. Smith,Jonathan P. Shock,Samuel T. Segun,Iyiola E. Olatunji,Tegawendé F. Bissyandé
机构: International Development Research Centre (国际发展研究中心); University of Cape Town (开普敦大学); Global Center on AI Governance (全球人工智能治理中心); SnT, University of Luxembourg (卢森堡大学); CITADEL AI Centre of Excellence (CITADEL人工智能卓越中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures, 6 tables
Abstract:While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.
[NLP-5] General Preference Reinforcement Learning NEURIPS2026
【速读】: 该论文试图解决大语言模型(LLM)对齐中在线强化学习(RL)与偏好优化之间的割裂问题:一方面,基于可验证奖励的在线RL在数学和代码任务上展现出涌现推理能力,但依赖程序化验证器,无法处理开放式任务;另一方面,偏好优化虽能应对开放生成任务,却缺乏持续探索机制。解决方案的关键在于提出通用偏好模型(General Preference Model, GPM),将响应嵌入到k个斜对称子空间中,并以结构化的、考虑非传递性的比较方式表示偏好,从而避免单一标量奖励导致的维度主导问题。在此基础上,作者进一步设计了通用偏好强化学习(GPRL),其核心创新是将GPM的k维结构贯穿至策略更新过程——通过计算各维度的组相对优势、按各自尺度归一化防止某一轴占优,并结合上下文相关的特征值聚合;同时利用相同结构实现闭环漂移监控,动态检测并纠正单轴滥用行为,提升训练稳定性。实验表明,GPRL在AlpacaEval 2.0上达到56.51%长度控制胜率,且在Arena-Hard、MT-Bench和WildBench等基准上优于SimPO和SPPO,展现出更强的抗奖励黑客攻击能力。
链接: https://arxiv.org/abs/2605.18721
作者: Muhammad Umer,Muhammad Ahmed Mohsin,Ahsan Bilal,Arslan Chaudhry,Andreas Haupt,Sanmi Koyejo,Emily Fox,John M. Cioffi
机构: Stanford University (斯坦福大学); The University of Oklahoma (俄克拉荷马大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to NeurIPS 2026
Abstract:Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into k skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the k -way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from \textttLlama-3-8B-Instruct , GPRL reaches a length-controlled win rate of 56.51% on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
[NLP-6] EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
【速读】: 该论文旨在解决大语言模型(LLM)通过代理强化学习(Agentic RL)获取工具使用能力时面临的两大瓶颈问题:一是缺乏可扩展且鲁棒的执行环境,二是真实训练数据稀缺,难以捕捉人类隐含的推理过程。现有方法依赖昂贵的真实世界API、易产生幻觉的LLM模拟器或单轮次、依赖预收集文档的合成环境,且合成轨迹常过度指定,偏离自然人类意图,导致强化学习训练效果不佳。论文提出的解决方案核心是EnvFactory——一个全自动框架,能够从真实资源中自主探索并验证状态化的可执行工具环境,并通过拓扑感知采样与校准优化生成具有隐含意图的多轮自然轨迹。仅用85个验证环境生成2,575条监督微调(SFT)和强化学习(RL)轨迹,便在多个基准测试中显著优于以往方法(如Qwen3系列模型在BFCLv3上提升+15%),证明其在训练效率和下游性能上的优越性。该方案实现了环境构建与轨迹合成的全自动化,为代理强化学习提供了可扩展、可扩展且稳健的基础。
链接: https://arxiv.org/abs/2605.18703
作者: Minrui Xu,Zilin Wang,Mengyi DENG,Zhiwei Li,Zhicheng Yang,Xiao Zhu,Yinhong Liu,Boyu Zhu,Baiyu Huang,Chao Chen,Heyuan Deng,Fei Mi,Lifeng Shang,Xingshan Zeng,Zhijiang Guo
机构: LARK, HKUST (GZ); University of Cambridge; UCL; Huawei Technologies Co., Ltd
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages
Abstract:Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including \tau^2 -Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.
[NLP-7] Generative AI Advertising as a Problem of Trustworthy Commercial Intervention
【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)广告如何在不显式标注的情况下,通过干预生成过程本身对用户产生隐蔽的商业影响,从而挑战传统广告监管框架的有效性。其解决方案的关键在于提出一个基于“影响层级”的分类体系,将商业干预划分为四个层次——产品提及、信息框架塑造、行为引导和长期偏好塑造,这些层级对应于对越来越隐含变量的干预。论文指出,当前主流系统主要聚焦于最易识别和管控的第一层(如明确的产品提及),而对更深层次、更具影响力的干预形式缺乏检测、度量与披露机制,因此核心挑战在于如何使生成式 AI 中的商业影响具备可归因性、可测量性、可争议性和用户福祉一致性,即实现“可信干预”。
链接: https://arxiv.org/abs/2605.18673
作者: Jingyi Qiu,Qiaozhu Mei
机构: University of Michigan (密歇根大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Major deployed generative AI advertising systems preserve a visible boundary between commercial content and AI-generated responses. Yet empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users. We argue that generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels. This reframes generative AI advertising as a problem of trustworthy intervention rather than content placement. We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping; and show how these tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes. Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure. The central challenge is whether commercial influence in generative systems can be made trustworthy, i.e., attributable, measurable, contestable, and aligned with user welfare.
[NLP-8] GIM: Evaluating models via tasks that integrate multiple cognitive domains KR
【速读】: 该论文试图解决当前大语言模型(LLM)评估中存在的两个问题:一是通过提升知识难度(如GPQA、HLE)导致记忆能力与推理能力混淆;二是完全去除知识依赖以测试抽象推理(如ARC-AGI),使推理脱离实际应用场景。其解决方案的关键在于提出一种名为“ grounded integration measure (GIM)”的新基准,该基准包含820个原创问题(615个公开、205个私有),难度来源于多认知操作的整合(如约束满足、状态追踪、认知警觉性和受众校准),并基于广泛可获取的知识进行推理,从而确保推理过程既真实又不依赖专业领域专家知识。此外,作者构建了一个连续的两参数逻辑(2PL)项目反应理论(IRT)模型,在20万条提示-响应对上校准了28个模型的能力估计,有效缓解了因错误或缺失数据导致的原始准确率失真问题,并首次在固定基准上系统研究了测试时计算资源(test-time compute)与模型能力之间的权衡关系,揭示了同一模型家族内部配置选择(如思维预算和量化策略)的重要性甚至超过模型本身的选择。
链接: https://arxiv.org/abs/2605.18663
作者: Rohit Patel,Alexandre Rezende,Steven McClain
机构: Meta(脸书)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 56 pages, 27 figures, 4 tables. Code: this https URL ; Dataset: this https URL
Abstract:As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public–private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over 200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.
[NLP-9] An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration
【速读】: 该论文试图解决的问题是:在人类对齐的AI中,如何明确区分人类提供的软标签(human soft-labels)相较于合成标签(synthetic labels)所带来的实际收益,尤其是避免将软标签带来的校准改善与数据误标修正(mode shifts)混淆。解决方案的关键在于设计了一个受控审计实验,在MNIST及其合成变体上重新标注子集以提取人类不确定性,并通过解耦软标签监督信号与潜在的标签模式偏移(mode shifts),发现人类软标签虽能提升准确率,但其核心价值在于作为正则化项,显著改善模型在困难样本上的校准性能并促进训练过程的稳定性;此外,数据地图(dataset cartography)分析进一步表明,使用人类软标签训练的模型能够忠实反映人类不确定性,而合成标签则无法实现这种对齐。
链接: https://arxiv.org/abs/2605.18648
作者: Maja Pavlovic,Silviu Paun,Massimo Poesio
机构: Queen Mary University London (伦敦玛丽女王大学); Amazon (亚马逊)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.
[NLP-10] Language-Switching Triggers Take a Latent Detour Through Language Models
【速读】: 该论文试图解决语言模型中后门攻击(backdoor attack)的内部工作机制不明确的问题,特别是针对语言切换类后门攻击的可解释性不足。解决方案的关键在于识别并分解了一个存在于80亿参数自回归语言模型中的电路结构:该电路由三个阶段组成——(1)早期层中分布式的注意力头将三词拉丁语触发词(九个标记)整合至序列末尾位置;(2)该信号在中层通过一个与模型自然语言身份方向正交的子空间传播;(3)最终层的多层感知机(MLP)将这一潜在信号转化为法语logits。整个电路存在一个单一位置的串行瓶颈,该位置在任意层被破坏可完全消除触发效果,但也会损害模型能力。该发现表明,基于寻找中间表示中类似语言特征的防御策略可能完全遗漏此类触发机制,因其依赖的是正交于语言方向的隐式编码。
链接: https://arxiv.org/abs/2605.18646
作者: Francis Kulumba,Wissam Antoun,Théo Lasnier,Benoît Sagot,Djamé Seddah
机构: Inria Paris; Sorbonne Université
类目: Computation and Language (cs.CL)
备注: 15 pages, 16 figures. Under review
Abstract:Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model’s natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model’s capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.
[NLP-11] Post-Trained MoE Can Skip Half Experts via Self-Distillation
【速读】: 该论文旨在解决静态混合专家(MoE)模型在推理阶段无法根据输入动态调整激活专家、导致计算资源浪费的问题。现有动态MoE方法通常依赖从头预训练或任务特定微调,缺乏对已训练好的静态MoE模型进行高效转换的方案。其解决方案的关键在于提出零专家自蒸馏适配(ZEDA),一种低成本框架:通过向每层MoE注入参数无关的“零输出专家”来稳定架构转换,并采用两阶段自蒸馏机制——以原始静态MoE作为冻结教师模型,结合分组级平衡损失对增强模型进行适配。实验表明,ZEDA可在保持精度几乎不变的前提下,显著减少超过50%的专家浮点运算量(FLOPs),并在多个基准测试中优于最强动态MoE基线,同时实现约1.2倍的端到端推理加速。
链接: https://arxiv.org/abs/2605.18643
作者: Xingtai Lv,Li Sheng,Kaiyan Zhang,Yichen You,Siyan Gao,Xueheng Luo,Yuxin Zuo,Yuchen Fan,Junlin Yang,Ganqu Cui,Bingning Wang,Fan Yang,Youbang Sun,Ning Ding,Bowen Zhou
机构: Frontis.AI; Kuaishou Technology; Shanghai AI Lab; WeChat AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20 \times end-to-end inference speedup.
[NLP-12] Forecasting Downstream Performance of LLM s With Proxy Metrics
【速读】: 该论文试图解决语言模型开发中性能预测不可靠的问题,即在模型架构选择、预训练语料筛选和训练过程监控等决策场景下,传统指标(如交叉熵损失)与下游任务能力相关性弱,而直接下游评估又成本高、早期信息稀疏。其解决方案的关键在于构建代理指标(proxy metrics),通过聚合候选模型在专家编写的解题轨迹上生成的词元级统计量(如熵、top-k准确率、专家词元排名)来捕捉模型能力信号。实验表明,该方法在跨家族模型选择、预训练数据排序和训练中期性能外推三个场景中均显著优于基于损失或计算量的基线,实现了更高效、精准的性能预测,从而提升了模型开发全生命周期中的决策可靠性。
链接: https://arxiv.org/abs/2605.18607
作者: Arkil Patel,Siva Reddy,Marius Mosbach,Dzmitry Bahdanau
机构: Mila – Quebec AI Institute; McGill University; Canada CIFAR AI Chair; ServiceNow Research; Periodic Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 31 pages
Abstract:Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model’s next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10,000\times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18\times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.
[NLP-13] AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
【速读】: 该论文试图解决的问题是:基于评分标准(rubric)的奖励塑形方法在强化学习(RL)微调大语言模型(LLM)时,仅利用当前步骤的局部信号(如本轮 rollout 或成对比较),而忽略了评估过程中产生的诊断信息的长期积累与复用,导致系统无法识别重复出现的次优行为,也难以实现类似教学课程的渐进式训练进展。解决方案的关键在于引入 AMARIS,其核心机制是将评分标准的调整建立在长期训练历史之上:在每个训练步骤中,AMARIS 分析单个 rollout,聚合结果为步骤级摘要,并通过静态(近期步骤)和动态(语义匹配)检索方式从持久化的评估记忆库中获取相关历史上下文,进而更新评分标准。该过程异步执行,仅带来约 5% 的时间开销,实验证明其在封闭和开放域任务中均显著优于基线方法,且消融实验表明两种检索机制均贡献性能提升,组合使用效果最佳。这表明,持久化评估记忆可将评分驱动的奖励塑形从无状态的逐步启发式方法转变为以证据为基础的闭环强化学习训练流程。
链接: https://arxiv.org/abs/2605.18592
作者: Peilin Wu,Xinlu Zhang,Kun Wan,Wentian Zhao,Gang Wu,Xinya Du,Zhiyu Chen
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Adobe Inc. (Adobe公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review
Abstract:Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS analyzes individual rollouts, aggregates findings into step-level summaries, retrieves relevant historical context from a persistent evaluation memory through both static (recent steps) and dynamic (semantically matched) retrieval, and updates rubrics based on these accumulated analyses. This procedure runs asynchronously alongside the normal RL loop with minimal overhead. Experiments across both closed and open-ended domains show that AMARIS consistently outperforms the baselines. Ablation studies show that static and dynamic memory retrieval contributes to the performance gain and their combination provides the strongest results with moderate retrieval budgets sufficient to provide most of the gain, and that the entire pipeline adds only ~5% time overhead through asynchronous execution. These results show that persistent evaluation memory can transform rubric-based reward shaping from a stateless, per-step heuristic into an evidence-driven loop for RL training.
[NLP-14] Overeager Coding Agents : Measuring Out-of-Scope Actions on Benign Tasks
【速读】: 该论文试图解决的问题是:当前自主运行的代码生成代理(coding agents)在执行用户良性请求时,常出现“范围扩张”(scope expansions)行为,即超出授权范围进行删除文件、清除凭证或修改配置等过度操作,这属于一种独立于能力失效、提示注入或沙箱逃逸的授权问题。解决方案的关键在于提出 OverEager-Gen 基准测试框架,通过三重机制确保评测有效性:首先使用行为梯度验证器筛选具有判别力的场景;其次采用双通道栈(PATH注入中间件 + 每代理事件流)审计内部工具调用;最后提供字节级一致的“保留同意”与“移除同意”变体以隔离语言模型对权限声明的敏感性。实验表明,移除许可声明会使部分代理的过激行为率从 0% 显著上升至 17.1%,且框架类型(如“持续询问” vs. “宽松授权”)对过激行为影响远大于基础模型差异,揭示出授权边界认知缺失是系统性风险源。
链接: https://arxiv.org/abs/2605.18583
作者: Yubin Qu,Ying Zhang,Yanjun Zhang,Gelei Deng,Yuekang Li,Leo Yu Zhang,Yi Liu
机构: Griffith University; Wake Forest University; Nanyang Technological University; University of New South Wales; Quantstamp
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign tasks. Building it surfaces a measurement-validity issue: if a benchmark spells out the authorized scope inside the prompt, the agent stops inferring boundaries and starts pattern-matching declaration text. On Claude Code, stripping the consent declaration alone raises the overeager rate from 0.0% to 17.1% on paired scenarios (McNemar exact p = 2.4 x 10^-4). OverEager-Gen therefore certifies each scenario’s discriminative power before admission via a behavioral-gradient validator, audits internal tool calls through a dual-channel stack (PATH-injected shim plus per-agent event streams), and ships byte-identical consent_kept and consent_stripped variants. OverEager-Bench contains 500 validated scenarios and ~7,500 runs across four agent products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and six base models; a 50-sample re-annotation gives Cohen’s kappa = 0.73 and rule-judge recall = 1.00. Stripping consent multiplies the overeager rate on every shared base model (Delta in [11.9, 17.2] pp). The framework axis dominates effect size: a permissive cluster (Claude Code, Codex CLI, Gemini CLI) runs at 5.4-27.7% while the ask-to-continue framework (OpenHands) sits at 0.2-4.5% (Fisher p = 10^-5). Within-framework base-model variance reaches 15.9 pp, indicating that model-layer alignment does not fully propagate through permissive permission gating. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2605.18583 [cs.SE] (or arXiv:2605.18583v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.18583 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-15] MA2P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion ACL2026
【速读】: 该论文旨在解决复杂说服对话生成中的两大核心问题:一是如何在说服对象(persuadee)内部状态不明确时,准确推断其潜在心理状态(如信念和欲望),并据此制定一致且有针对性的策略;二是如何提升大语言模型(LLMs)在不同领域间的表现一致性,克服因知识覆盖不均和推理泛化能力有限导致的性能波动。解决方案的关键在于提出MA²P框架——一个元认知自主智能体架构,通过多智能体协同实现感知管理、心理状态推断、策略执行、记忆维护与绩效评估的闭环流程,并引入元认知配置器(meta-cognitive configurator),从结构化知识库中选择合适的元策略以指导后续推理与规划,从而显著提升说服成功率和跨域适应性。
链接: https://arxiv.org/abs/2605.18572
作者: Dingyi Zhang,Ziqing Zhuang,Linhai Zhang,Ziyang Gao,Deyu Zhou
机构: Southeast University (东南大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures. Accepted to Findings of ACL 2026
Abstract:Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee’s internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee’s latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA ^2 P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.
[NLP-16] GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems
【速读】: 该论文试图解决信息系统(IS)研究中因构念(construct)定义不一致而导致知识累积发展受阻的问题。其解决方案的关键在于提出一种将多个结构方程模型(SEM)整合为统一模型的方法:首先利用任务适配的文本嵌入(task-adapted text embeddings)与聚类技术生成候选构念分组;随后通过一个显式权衡语义纯度(semantic purity)与聚类数量简约性(parsimony)的损失函数,选择最优分组方案。该方法的核心优势在于明确量化了构念分组在语义纯净性与模型简洁性之间的权衡关系,从而支持对不同优先级下构念结构及其相互关系变化的系统分析。
链接: https://arxiv.org/abs/2605.18567
作者: Maximilian Reinhardt,Jonas Scharfenberger,Burkhardt Funk
机构: Leuphana University (吕讷堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the 34th European Conference on Information Systems (ECIS 2026), Milan, Italy
Abstract:Structural equation modeling is widely used in IS research. However, inconsistent construct definitions impede the cumulative development of knowledge. In this work, we present an approach that aims at the integration of structural equation models into a unified model: We use a combination of task-adapted text embeddings and clustering to produce a candidate set of construct groupings. Subsequently, we select the optimal solution using a loss function that explicitly trades off semantic purity and parsimony in the number of clusters. By making this trade-off explicit, our approach allows to analyze how construct groupings and their relations change as one shifts the priority from purity to parsimony. Empirically, we evaluate and explore the proposed methodology on two datasets from the IS domain.
[NLP-17] LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
【速读】: 该论文试图解决现有记忆增强型智能体在长时程、动态演化场景中因记忆干扰而导致的性能下降问题,这些问题在真实世界应用中普遍存在但未被现有基准充分捕捉。解决方案的关键在于提出一个名为LongMINT(Long-Horizon Memory under INTerference)的新基准,其核心特征包括:(1) 长且高度互联的上下文结构,包含频繁更新的信息以诱发显著的记忆干扰;(2) 覆盖状态跟踪、多轮对话、维基百科修订和GitHub提交等多样领域,用于评估跨域泛化能力;(3) 包含单目标回忆与多目标聚合两类任务,系统性测试模型对干扰的鲁棒性。实验表明,当前主流方法(如纯长上下文大模型、检索增强生成RAG及记忆增强框架)平均准确率仅为27.9%,尤其在需要多证据聚合推理的任务上表现不佳,根本原因在于检索与记忆构建能力不足,且无法有效处理后续信息更新对早期事实的干扰。
链接: https://arxiv.org/abs/2605.18565
作者: Hyunji Lee,Justin Chih-Yao Chen,Joykirat Singh,Zaid Khan,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution; order decided by a coin flip. Code and data: this https URL
Abstract:Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.
[NLP-18] Readers make targeted regressions to plausible errors in reanalysis of “noisy-channel garden-path” sentences
【速读】: 该论文试图解决的问题是:在语言理解过程中,读者如何逐步推断语言输入的含义,特别是在遇到“噪声信道花园路径句”(noisy-channel garden-path sentences)时,这类句子表面上看似合乎语法,但后期出现预期违背,且不依赖于重新分析句法结构来解决,而是通过推断存在错误来化解。解决方案的关键在于发现目标性回视(targeted regressions)现象——即读者的眼动行为会指向可能包含错误信息的区域,这种模式与基于噪声信道处理模型的后验推断结果一致,从而为噪声信道语言理解理论及信息论视角下的阅读动态提供了实证支持。
链接: https://arxiv.org/abs/2605.18563
作者: Thomas Hikaru Clark,Roger Levy,Edward Gibson
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:A key question in psycholinguistics is how inferences about the meaning of linguistic input unfold incrementally a comprehender’s mind. In this work, we study reading dynamics for ``noisy-channel garden-path’’ sentences, which temporarily appear well-formed but feature late-appearing violations of expectation that can be resolved not by inferring an alternative syntactic structure, but by inferring the presence of an error. We find evidence for targeted regressions – eye movements towards regions that are promising loci of possible errors in light of later-arriving information, showing patterns consistent with the posterior inferences of a model of noisy-channel processing with reanalysis. We discuss the implications of these findings for theories of noisy-channel language comprehension and information-theoretic explanations of reading dynamics.
[NLP-19] Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
【速读】: 该论文试图解决的问题是:大型推理模型(Large Reasoning Models, LRMs)的思维链(Chain of Thought, CoT)推理过程并不总是忠实于最终输出,从而削弱了其作为安全监控工具的可靠性。为了解决这一问题,论文提出通过分析模型内部隐藏表示来预测未来行为,并构建“探测轨迹”(probe trajectory)——即在每个生成token处进行探测所形成的连续演化路径。关键创新在于:相比单一静态预测,从完整轨迹中提取的时间动态特征(如波动性、趋势和稳态行为)能显著提升对模型未来状态的区分能力;此外,研究发现最大池化(max-pooling)策略优于平均池化或最后token池化,可实现高达95%的AUROC性能,且轨迹稳定性更高;同时,模板式训练数据即可达到与动态生成响应相当的效果,避免了昂贵的初始推理与标注成本。这些方法共同构成了一个可解释、高效且可靠的LRM行为监测框架。
链接: https://arxiv.org/abs/2605.18549
作者: Maciej Chrabąszcz,Aleksander Szymczyk,Marcin Sendera,Tomasz Trzciński,Sebastian Cygert
机构: NASK - National Research Institute, Poland; Faculty of Electronics and Information Technology, Warsaw University of Technology, Poland; Faculty of Mathematics and Computer Science, Jagiellonian University, Poland; Tooploox, Poland; IDEAS Research Institute, Poland; Gdańsk University of Technology, Poland
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model’s final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept’s probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2605.18549 [cs.CL] (or arXiv:2605.18549v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.18549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-20] STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
【速读】: 该论文试图解决大语言模型(LLM)在真实代理应用中面对时空动态变化时缺乏自适应重规划能力的问题。现有动态基准主要评估模型对时间变化的检测能力,但忽略了在空间与时间双重扰动下进行有效重规划的核心挑战。解决方案的关键在于构建了一个名为STT-Arena的高质量交互式任务基准,包含227个任务,覆盖九类时空冲突类型和四类可解性等级,并引入可执行环境中的时空触发器以强制模型识别状态变化并制定新的执行策略。实验表明,即使是当前最先进的商用模型(如Claude-4.6-Opus)整体准确率也低于40%,凸显了该问题的难度;进一步分析揭示了三大典型失败模式:过时状态执行、动态触发器误诊断及适应后验证缺失;基于此,作者提出一种迭代轨迹精炼技术,结合在线强化学习(online RL),训练出STT-Agent-4B,在STT-Arena上显著优于现有主流模型。
链接: https://arxiv.org/abs/2605.18548
作者: Tingfeng Hui,Hao Xu,Pengyu Zhu,Hongsheng Xin,Kun Zhan,Sen Su,Chunxiao Liu,Ning Miao
机构: City University of Hong Kong (香港城市大学); Beijing University of Posts and Telecommunications (北京邮电大学); Li Auto Inc. (理想汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.
[NLP-21] Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
【速读】: 该论文试图解决的问题是:连续扩散语言模型(continuous diffusion language model, DLM)在可扩展性上被认为逊于离散方法,这一观点是否成立?为此,作者重新审视了基于似然的连续DLM Plaid,并构建了RePlaid,通过将Plaid的架构与现代离散DLM对齐,在统一框架下建立了首个适用于连续DLM的缩放定律(scaling law)。其解决方案的关键在于:1)通过架构对齐使连续DLM具备与离散模型相当的性能表现,如RePlaid仅比自回归模型存在约20倍计算差距、参数更少却优于Duo、在过训练阶段超越MDLM;2)在OpenWebText上达到22.1的最低困惑度(PPL),成为连续DLM的新SOTA;3)理论层面揭示了似然训练的优势——优化噪声调度以最小化ELBO方差自然导致线性交叉熵(信息损失)随时间分布,均匀分配去噪难度,无需特定时间重参数化;4)发现似然训练中优化嵌入能形成结构化几何空间,带来最大的似然提升。这些结果表明,基于似然的连续扩散是一种具有竞争力且可扩展的语言建模方案。
链接: https://arxiv.org/abs/2605.18530
作者: Zhihan Yang,Wei Guo,Shuibai Zhang,Subham Sekhar Sahoo,Yongxin Chen,Arash Vahdat,Morteza Mardani,John Thickstun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only 20\times compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of 22.1 among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO’s variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.
[NLP-22] Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection ICML2026
【速读】: 该论文试图解决生成式 AI(Generative AI)在上下文学习(In-context learning, ICL)中因演示样本(demonstration)选择不当而导致性能波动的问题。由于可能的演示组合空间极大,传统方法难以高效筛选最优上下文。其解决方案的关键在于提出 DiSP 框架——一个“采样-判断”范式:首先通过随机采样估计每个训练查询的失败率以分层查询难度;其次训练轻量级路由模型预测查询难度;再为不同难度层级训练专用的判断器(judge)来评估演示有效性;推理时采用带预算限制的“接受即停”策略,并在无合适上下文时输出风险标签。该方法显著提升了准确率(相比强基线最高提升 3.4%),同时实现高达 23 倍的端到端速度加速。
链接: https://arxiv.org/abs/2605.18512
作者: Haochun Wang,Chaofen Yang,Jiatong Liu,Jingbo Wang,Zewen Qiang,Sendong Zhao,Bing Qin,Ting Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026
Abstract:In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \empheasier to judge than to find: predicting whether a specific query–context pair (q,D) will succeed is cheaper and more general than searching for an optimal D^\star . Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3–8B and Qwen~2.5–7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4%, while achieving up to 23\times end-to-end wall-clock speedup.
[NLP-23] Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLM s and NMT Models LREC2026
【速读】: 该论文试图解决古希腊语(Ancient Greek, AG)到现代希腊语(Modern Greek, MG)机器翻译(Machine Translation, MT)这一低资源任务中因缺乏大规模高质量平行语料库而导致的性能瓶颈问题。解决方案的关键在于:首先构建了一个包含132,481句对的AG-MG平行语料库,其创建流程结合了网络爬取的片段级数据与多阶段句子级对齐及精修机制;具体而言,采用VecAlign配合LaBSE嵌入向量进行初步对齐,并在小规模人工标注子集上微调该模型,随后利用Gemini 2.5 Flash大语言模型(LLM)执行错误修正和对齐优化,从而显著提升语料质量;此外,论文首次系统性地评估了多种主流NMT模型(NLLB、M2M100)和希腊语专用LLM(Llama-Krikri-8B)在该任务上的表现,结果表明全参数微调的Llama-Krikri-8B模型取得最高BLEU得分(13.16),而QLoRA适配的M2M100-1.2B模型则展现出最大的相对提升幅度,验证了所提方法的有效性和实用性。
链接: https://arxiv.org/abs/2605.18504
作者: Spyridon Mavromatis,Sokratis Sofianopoulos,Prokopis Prokopidis,Maria Giagkou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages. Accepted for presentation at the 15th Language Resources and Evaluation Conference (LREC 2026), Palma, Mallorca, Spain
Abstract:Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.
[NLP-24] Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在工具调用(tool invocation)过程中因与执行紧密耦合而导致的推理连贯性下降和表达能力受限问题,进而影响整体推理性能。其解决方案的关键在于首次提出并形式化了“解耦工具调用与执行”的问题,并引入带有显式控制的延迟执行机制以增强工具集成推理(Tool-Integrated Reasoning, TIR)。此外,作者设计了一个分层控制框架,并理论推导出一个代理损失函数,使得隐式分层策略能够学习到等效于显式分层策略的行为,从而提出了IH-GRPO算法。实验表明,该方法在多个数学推理基准上相对于最强基线实现了显著提升(绝对改进达1.87%~2.53%),且在其他领域也保持一致的性能优势。
链接: https://arxiv.org/abs/2605.18500
作者: Li Wang,Xiaohan Wang,Xiaodong Lu,Zipeng Zhang,Jinyang Wu,Jiajun Chai,Wei Lin,Guojun Yin
机构: Meituan(美团); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87%, 2.16%, and 2.53% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at this https URL.
[NLP-25] Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在广泛部署和再分发背景下,模型溯源追踪(model provenance tracking)所面临的可扩展性瓶颈问题。现有主动指纹嵌入方法通常通过微调(fine-tuning)注入身份信号,虽具备高准确性和鲁棒性,但每次新增身份都需要独立、资源密集的训练过程,导致计算成本高昂且部署延迟严重。解决方案的关键在于提出 Prompt2Fingerprint(P2F),首次将指纹注入重构为一个条件参数生成任务:通过专用生成器,直接将文本描述映射为低秩参数增量(low-rank parameter increments),仅需单次前向传播即可完成指纹注入,无需对模型进行额外训练。这使得指纹注入成为即插即用(plug-and-play)的能力,显著降低计算开销,同时保持高准确性、无害性和鲁棒性,从而实现高效、即时的LLM所有权管理。
链接: https://arxiv.org/abs/2605.18474
作者: Sixu Chen,Xiang Chen,Hongyao Yu,Jiaxin Hong,Hao Fang,Shuoyang Sun,Bin Chen,Shu-Tao Xia
机构: Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; South China University of Technology, Guangzhou, China; Harbin Institute of Technology, Shenzhen, Shenzhen, China
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The widespread deployment and redistribution of large language models (LLMs) have made model provenance tracking a critical challenge. While existing LLM fingerprinting methods, particularly active approaches that embed identity signals via fine-tuning, achieve high accuracy and robustness, they suffer from significant scalability bottlenecks. These methods typically treat fingerprint injection as an independent, one-off optimization task rather than a reusable capability, necessitating separate, resource-intensive training for every new identity. This incurs prohibitive computational costs and deployment delays. To address this, we propose Prompt2Fingerprint (P2F), the first framework that reformulates fingerprinting as a conditional parameter generation task. By leveraging a specialized generator, P2F maps textual descriptions directly to low-rank parameter increments in a single forward pass, enabling plug-and-play LLM fingerprint injection without further model retraining. Our experiments demonstrate that P2F maintains high fingerprint accuracy, harmlessness, and robustness while significantly reducing computational overhead, offering a scalable and instant solution for LLM ownership management.
[NLP-26] From BERT to T5: A Study of Named Entity Recognition
【速读】: 该论文旨在解决命名实体识别(Named Entity Recognition, NER)任务中如何有效利用预训练模型进行微调的问题,尤其对比编码器-only结构的BERT与序列到序列架构的T5在NER性能上的差异。其解决方案的关键在于:(1)对BERT采用带权重的交叉熵损失函数以处理类别不平衡问题;(2)对T5则通过少量示例提示(few-shot prompts)进行微调,并设计两种验证策略优化训练过程;(3)通过消融实验分析不同超参数对模型表现的影响,并深入剖析两类模型在常见错误模式上的差异,从而为后续实际应用场景提供可借鉴的模型选择依据和优化方向。
链接: https://arxiv.org/abs/2605.18462
作者: Mei Jia
机构: University of Manchester
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures
Abstract:Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models’ performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.
[NLP-27] What is Holding Back Latent Visual Reasoning ?
【速读】: 该论文试图解决的问题是:当前视觉-语言模型中利用连续潜在标记(latent tokens)进行链式思维推理(chain-of-thought reasoning)的效果不佳,其根本原因尚不明确。解决方案的关键在于揭示潜在标记在模型推理过程中的因果作用,并识别阻碍其有效性的两大核心问题:第一,现有数据集中oracle潜在标记提供的信息有限,无法显著简化任务,导致模型在训练中忽略这些标记,推理时也绕过它们;第二,推理阶段生成的潜在标记与理想标记(oracle representations)存在偏差,趋向于坍缩到一个狭窄区域,即便模型依赖这些标记也无法获得性能提升。因此,未来改进潜在视觉推理的关键在于构建包含高信息量中间步骤的数据集和提升潜在标记预测的准确性。
链接: https://arxiv.org/abs/2605.18445
作者: André G. Viveiros,Nuno Gonçalves,André F. T. Martins,Matthias Lindemann
机构: Instituto Superior Técnico, Universidade de Lisboa (里斯本理工学院,里斯本大学); Instituto de Telecomunicações (电信研究所); TransPerfect (TransPerfect); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative ``dummy’’ tokens. This indicates that latent tokens play a minimal causal role in the model’s final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.
[NLP-28] EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
【速读】: 该论文试图解决的问题是:当前大型语言模型(LLM)代理的评估主要集中在推理、规划和执行能力上,而记忆机制的重要性被忽视,且缺乏系统性的评测方法来衡量代理在不同场景下的记忆性能。解决方案的关键在于提出一个统一的基准测试框架 EvoMemBench,其设计沿两个维度展开:记忆范围(episode 内 vs. episode 间)和记忆内容(知识导向型 vs. 执行导向型),从而对15种代表性记忆方法与强基线进行标准化对比。实验表明,现有记忆系统尚未形成通用方案,长上下文基线仍具竞争力,记忆在当前上下文不足或任务复杂时效果最佳,且无单一记忆形式在所有场景下表现一致——检索类方法在知识密集型任务中表现优异,而过程性和长期记忆方法在任务结构与存储经验匹配时更适用于执行导向任务。
链接: https://arxiv.org/abs/2605.18421
作者: Yuyao Wang,Zhongjian Zhang,Mo Chi,Kaichi Yu,Yuhan Li,Miao Peng,Bing Tong,Chen Zhang,Yan Zhou,Jia Li
机构: Hong Kong University of Science and Technology (Guangzhou); Createlink Technology; Beijing University of Posts and Telecommunications; Beijing Institute of Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at this https URL.
[NLP-29] SkillsVote: Lifecycle Governance of Agent Skills from Collection Recommendation to Evolution
【速读】: 该论文试图解决长时程大语言模型(LLM)智能体在执行任务过程中产生的轨迹(trajectory)噪声大、难以管理且难以复用的问题。其核心挑战在于如何从开放环境中收集和利用高质量的“技能”(Skill),同时避免冗余、环境敏感性及不当更新对系统性能的污染。解决方案的关键是提出SkillsVote框架,这是一个面向Agent Skills全生命周期治理的机制:首先通过百万级开源语料库对技能进行质量、环境依赖性和可验证性建模;其次在执行前基于结构化技能库进行代理式检索以提供上下文引导;执行后将轨迹分解为与技能关联的子任务,并量化技能使用、代理探索、环境变化和结果信号的贡献,仅保留成功可复用的发现用于证据驱动的更新。实验表明,离线演进使GPT-5.2在Terminal-Bench 2.0上提升7.9个百分点,线上演进使SWE-Bench Pro提升2.6个百分点,证明受控的外部技能库可在不更新模型的前提下显著增强冻结智能体的能力。
链接: https://arxiv.org/abs/2605.18401
作者: Hongyi Liu,Haoyan Yang,Tao Jiang,Bo Tang,Feiyu Xiong,Zhiyu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 44 pages, 7 figures, 5 tables
Abstract:Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.
[NLP-30] Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLM s CONLL2026 ACL2026
【速读】: 该论文试图解决的问题是:大型语言模型(LLM)在条件句中预设投射(presupposition projection)的理解能力尚未得到充分评估,而这一现象对意义和语用理论至关重要。解决方案的关键在于设计了一个平行行为实验,通过一个控制前提与预设之间关系的标准化条件句数据集,比较人类判断与LLM预测的一致性;同时引入基于语言学理论的检查清单(linguistically motivated checklist)嵌入到LLM-as-a-Judge框架中,系统评估模型推理过程。研究发现,人类会整合概率性和语用线索进行判断,而LLM的表现则呈现不一致性——表现最佳的模型往往缺乏连贯的语用推理,反而是具备更强推理能力的模型产生更偏离人类判断的结果,表明当前LLM在该任务上的表现可能源于表面模式匹配而非真正的语用能力。
链接: https://arxiv.org/abs/2605.18352
作者: Tara Azin,Yongan Yu,Raj Singh,Olessia Jouravlev
机构: Carleton University (卡尔顿大学); McGill University (麦吉尔大学); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of CoNLL 2026, colocated with ACL 2026
Abstract:Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs’ performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.
[NLP-31] Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
【速读】: 该论文试图解决大规模新闻语料库在计算社会科学和自然语言处理(NLP)研究中可访问性受限的问题:商业新闻数据库成本高昂且授权限制严格,而开源替代方案如Common Crawl的CC-News则需要TB级存储和高计算资源进行处理。其解决方案的关键在于构建Infini-News——一个针对整个CC-News档案(2016年8月至最新快照)的检索工具包与索引系统,核心创新包括:(1) 提取并清洗超过13.5亿篇新闻文章的文本及结构化元数据;(2) 利用三种前沿语言检测模型(GlotLID、lingua 和 CommonLingua)增强语料的语言识别能力,并通过多源地理定位技术为83.4%的文章确定国家归属(覆盖222个国家);(3) 构建Infini-gram索引——基于后缀数组(suffix-array)的数据结构,支持研究人员以亚秒级延迟搜索任意文本模式。这一系列工作显著降低了跨国家、长期媒体研究的门槛。
链接: https://arxiv.org/abs/2605.18337
作者: Ruggero Marino Lazzaroni,Jana Lasser,Kirill Solovev
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl’s CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.
[NLP-32] Knowledge-to-Verification: Exploring RLVR for LLM s in Knowledge-Intensive Domains
【速读】: 该论文试图解决的问题是:如何将可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)有效应用于知识密集型领域,因为这些领域缺乏高质量的可验证数据;同时,现有RLVR方法仅关注最终答案的正确性,导致推理过程存在缺陷且奖励信号稀疏。解决方案的关键在于提出K2V(Knowledge-to-Verification)框架,通过自动化生成可验证数据来扩展RLVR在知识密集型领域的应用,并引入对语言模型推理过程的验证机制,从而提升模型在这些领域中的推理能力,同时保持其通用性能不受显著影响。
链接: https://arxiv.org/abs/2605.18261
作者: Zhonghang Yuan,Zhefan Wang,Fang Hu,Zihong Chen,Jinzhe Li,Gang Li,Jie Ying,Huanjun Kong,Songyang Zhang,Nanqing Dong
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM’s reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model’s general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at this https URL.
[NLP-33] CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook ACL2026
【速读】: 该论文旨在解决多模态表示对齐中因跨模态信息差异和数据稀缺导致的对齐空间优化不足问题,尤其关注传统方法难以保留模态特有特征的局限性。其解决方案的关键在于提出CodeBind框架,通过设计共享-特定(shared-specific)代码本结构,将特征分解为用于语义一致性的共享成分和用于捕捉模态独特细节的特定成分;该框架采用组合式向量量化机制,利用共享代码本桥接模态间隙,同时通过模态专属代码本抑制主导模态的表示偏倚,从而实现无需完全配对数据即可渐进式对齐多种模态的目标。
链接: https://arxiv.org/abs/2605.18257
作者: Zeyu Chen,Jie Li,Kai Han
机构: Visual AI Lab, The University of Hong Kong (视觉AI实验室,香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Findings; Project page: this https URL
Abstract:Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.
[NLP-34] Machine Unlearning for Masked Diffusion Language Models
【速读】: 该论文试图解决的问题是:当前针对生成式 AI(Generative AI)模型的机器遗忘(machine unlearning)方法主要集中在自回归大语言模型(LLMs),而对最近兴起的掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)的遗忘机制尚属空白。MDLMs 通过并行迭代去噪被掩码的位置来生成文本,其学习和微调过程与自回归模型存在本质差异,因此传统遗忘方法难以直接适用。
解决方案的关键在于提出首个面向 MDLM 的遗忘框架——掩码扩散遗忘(Masked Diffusion Unlearning, MDU)。MDU 从扩散过程的角度重新审视知识学习机制,通过最小化每个掩码位置上提示条件预测分布与提示掩码无条件锚点分布之间的前向 KL 散度,并引入温度缩放参数以控制隐私与效用之间的权衡,从而实现对特定训练数据的有效遗忘。实验表明,MDU 在多个标准基准和 MDLM 骨干模型上均显著优于现有 LLM 忘记方法。
链接: https://arxiv.org/abs/2605.18253
作者: Georu Lee,Seungwon Jeong,Hoki Kim,Jinseong Park,Woojin Lee
机构: Dongguk University-Seoul(东国大学-首尔); Chung-Ang University(中央大学); Korea Institute for Advanced Study(韩国高等科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures, appendix included
Abstract:Recent masked diffusion language models (MDLMs), such as LLaDA and Dream, have achieved performance comparable to autoregressive large language models. Unlike autoregressive models, which generate text sequentially, MDLMs generate text by iteratively denoising masked positions in parallel. During fine-tuning, MDLMs learn to recover responses from masked response states conditioned on a prompt, thereby shifting their predictions from a prompt-masked unconditional distribution toward a prompt-conditional distribution. Despite this distinct generative and fine-tuning mechanism, machine unlearning for MDLMs remains largely unexplored. In this paper, we propose Masked Diffusion Unlearning (MDU), the first unlearning framework for MDLMs, by revisiting the process of learning specific knowledge in terms of diffusion. Specifically, MDU minimizes a forward KL divergence from the prompt-conditional prediction to a prompt-masked unconditional anchor at every masked response position, with a temperature scaling parameter to control the privacy-utility trade-off. Our empirical results on standard benchmarks and MDLM backbones show that MDU achieves high unlearning performance compared to existing LLM unlearning methods. Code is available at this https URL.
[NLP-35] Multilingual jailbreaking of LLM s using low-resource languages
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在面对使用低资源非洲语言(如阿非利堪斯语、斯瓦希里语、科萨语和祖鲁语)的多轮对话时,是否能够绕过其安全防护机制。解决方案的关键在于发现并验证:单轮翻译攻击无效,而多轮交互式对话可显著提高越狱成功率,且翻译质量是决定越狱成败的核心因素——高质量翻译能有效提升攻击成功率,反之则限制其效果。研究通过自动化测试与母语者红队测试相结合的方式,在多个商业LLM上验证了这一结论,表明当前LLMs在多语言场景下的安全性仍存在明显漏洞,尤其受制于低资源语言的翻译准确性。
链接: https://arxiv.org/abs/2605.18239
作者: Dylan Marx,Marcel Dunaiski
机构: Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
Abstract:Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.
[NLP-36] Context Memorization for Efficient Long Context Generation
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在推理时依赖长条件前缀(long conditioning prefix)所导致的两个结构性问题:一是前缀对生成内容的影响随生成过程逐渐减弱,二是注意力计算复杂度随前缀长度线性增长。现有方法要么在推理时仍需完整访问前缀(如压缩前缀),要么通过梯度训练将前缀内化到模型参数中(如微调),但后者训练成本高且难以适应前缀更新。解决方案的关键在于提出“注意力状态记忆”(attention-state memory),这是一种无需训练的方法,将前缀外化为一个轻量级、基于查找表的预计算注意力状态存储结构,用于高效建模前缀与查询 token 之间的注意力关系。实验表明,该方法在 ManyICLBench 上于 1K–8K 内存预算下优于上下文学习,并在 8K 预算下将注意力延迟降低 1.36 倍;同时在 NBA 基准测试中,仅用全注意力检索增强生成(RAG)方法 20% 的内存开销即实现性能超越。
链接: https://arxiv.org/abs/2605.18226
作者: Yasuyuki Okoshi,Hao Mark Chen,Guanxi Lu,Hongxiang Fan,Masato Motomura,Daichi Fujiki
机构: Institute of Science Tokyo, Japan; Imperial College London, UK
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix’s influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.
[NLP-37] SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
【速读】: 该论文试图解决实时磁共振成像(rtMRI)在语音生成研究中因空间分辨率、时间分辨率与采集速度之间的权衡而导致的欠采样k空间数据和重建质量下降的问题。其解决方案的关键在于提出SIREM框架——一种利用同步语音作为跨模态先验的MRI重建方法,通过将每帧图像建模为音频驱动成分与MRI驱动成分的融合,并引入可学习的空间加权图谱以优化螺旋臂采样策略,从而实现音频引导预测、MRI重建与采样适配的统一多模态建模。该方法在保持高吞吐量的同时显著提升了重建图像的解剖合理性,为多模态语音引导的rtMRI重建提供了首个基准。
链接: https://arxiv.org/abs/2605.18221
作者: Md Hasan,Nyvenn Castro,Daiqi Liu,Lukas Mulzer,Jana Hutter,Jonghye Woo,Moritz Zaiss,Andreas Maier,Paula A. Perez-Toro
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); University Hospital Erlangen (埃尔朗根大学医院); Leibniz Universität Hannover (汉诺威莱布尼茨大学); Harvard Medical School (哈佛医学院); Massachusetts General Hospital (麻省总医院)
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注:
Abstract:Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at this https URL
[NLP-38] Leverag ing Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction ESWC2026
【速读】: 该论文旨在解决知识图谱中链接预测(link prediction)任务的性能瓶颈问题,尤其是现有序列到序列(Seq2Seq)模型因仅依赖实体和关系的表面文本描述、且将查询实体邻域简化为线性序列而忽略图结构信息所导致的表达能力不足。其解决方案的关键在于提出Graph-Augmented Sequence-to-Sequence (GA-S2S) 框架,该框架通过将T5-small编码器-解码器与关系图注意力网络(Relational Graph Attention Network, RGAT)相结合,同时编码查询实体周围k跳子图的拓扑结构和原始文本特征,并利用RGAT生成的关系感知嵌入增强编码器输出,从而有效捕捉多跳关系模式与文本信息的协同作用。实验表明,GA-S2S在CoDEx数据集上相比基线模型实现了最高达19%的相对准确率提升。
链接: https://arxiv.org/abs/2605.18211
作者: Luu Huu Phuc,Ratan Bahadur Thapa,Mojtaba Nayyeri,Jingcheng Wu,Evgeny Kharlamov,Steffen Staab
机构: Analytic Computing, KI, University of Stuttgart (斯图加特大学); Bosch Center for Artificial Intelligence (博世人工智能中心); WAIS, University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10–14, 2026, Dubrovnik, Croatia
Abstract:We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full k -hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT’s relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19% relative gain in link prediction accuracy.
[NLP-39] Scalable Environments Drive Generalizable Agents
【速读】: 该论文试图解决的问题是:当前通用智能体(generalizable agents)在面对训练分布之外的多样化任务和未知环境时,表现出适应性不足的问题。现有方法主要通过增加轨迹或任务数量来实现“规模扩展”(scaling),但忽略了环境中可执行规则集(executable rule-set)的变化,导致智能体在接口、动态特性、观测方式或反馈信号发生改变时变得脆弱。解决方案的关键在于引入“环境缩放”(environment scaling)——即系统性地扩展智能体所交互的可执行规则集分布,而不仅仅是扩大任务或轨迹范围。作者提出一个统一的分类法,区分轨迹缩放、任务缩放与环境缩放,并进一步对比了基于程序生成器(programmatic generators)与生成式世界模型(generative world models)的环境构建范式,强调将环境缩放与状态感知的学习机制(如跨环境自适应的更新规则)相结合,以推动具备鲁棒性的通用智能体的发展。
链接: https://arxiv.org/abs/2605.18181
作者: Jiayi Zhang,Fanqi Kong,Guibin Zhang,Maojia Song,Zhaoyang Yu,Jianhao Ruan,Jinyu Xiang,Bang Liu,Chenglin Wu,Yuyu Luo
机构: HKUST(GZ); DeepWisdom; PKU; NUS; SUTD; UdeM Mila
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.
[NLP-40] RACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在推理过程中产生的幻觉(hallucination)问题,尤其关注现有方法对中间层事实性证据动态变化的忽视。传统干预策略通常采用固定形式,如对比不同层、沿单一可信度方向调整或依赖外部证据,但这些方法无法适应跨层候选生成轨迹的复杂性——某些情况下真相信息存在于早期层却被后期抑制,而另一些情况下多方向竞争持续存在,单一标量修正操作不足以应对。解决方案的关键在于提出一种名为TRACE(Trajectory Correction from Cross-layer Evidence)的确定性、无需训练的推理时修正算法:它通过分析输入在LLM前向传播中各层的候选轨迹,自动选择最优修正层和对应的操作(包括标量反转、早期状态恢复或候选空间修正),完全基于模型内部证据实现自适应纠错。实验表明,TRACE在15个模型、8个模型家族和3个事实性基准上均取得一致提升,平均MC1得分提高+12.26,MC2-style得分提高+8.65,且无任何退化,验证了其通用性和有效性。
链接: https://arxiv.org/abs/2605.18163
作者: Tej Sanibh Ranade
机构: Independent Researcher
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 8 figures, 4 tables
Abstract:Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input’s cross-layer candidate trajectory inside the LLM’s own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.
[NLP-41] FOL2NS: Generating Natural Sentences from First-Order Logic
【速读】: 该论文试图解决的形式语言到自然语言的转换问题,即如何将一阶逻辑(First-Order Logic, FOL)公式准确且自然地转化为人类可读的句子表达,尤其针对具有复杂嵌套结构和不同量词深度(Quantifier Depth, QD)的场景。现有语料库难以覆盖此类结构,导致生成结果在语义精确性和自然度方面存在不足。解决方案的关键在于提出一种神经符号框架FOL2NS,通过规则驱动模块与微调语言模型的结合,增强生成样本的多样性与覆盖范围,从而提升对复杂结构的建模能力。实验表明,FOL2NS能够稳定生成语法正确、语义合理的模板和流畅语句,但在结构复杂性增加时仍面临语义精准表达和自然生成的挑战。
链接: https://arxiv.org/abs/2605.18155
作者: Mei Jia
机构: University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 8 figures
Abstract:Translating formal language into natural language is a foundational challenge in NLP, driving various downstream applications in semantic parsing, theorem validation, and question answering. In this study, we introduce First-Order Logic to Natural Sentence (FOL2NS), a neurosymbolic framework designed to generate synthetic FOL formulas and convert them into natural human expressions. It handles deeply nested structures with varying quantifier depths (QD), which are rarely captured by existing corpora. By combining rule-driven modules with fine-tuned language models, FOL2NS enhances the diversity and coverage of the generated samples. In our experiments, we systematically evaluate the framework’s capabilities through both character-level analysis and overall performance metrics. Experimental results show that FOL2NS can reliably produce well-formed templates and fluent statements, but it faces challenges in achieving precise semantic representations and natural generation as structural complexity increases.
[NLP-42] POE: Interpretable Prompt Optimization via Explanations
【速读】: 该论文试图解决的问题是:当前提示优化(prompt optimization)通常被视为一个离散搜索问题,旨在找到高性能且鲁棒的指令,但这种方法缺乏透明性,无法解释为何特定的提示修改能够带来性能提升。相比之下,人类在标注任务中通过精心设计的标注指南(annotation guidelines)来提高一致性,这种做法具有可解释性。为弥合这一差距,论文提出了一种名为iPOE(interpretable Prompt Optimization via Explanations)的新方法,其关键在于利用自动产生的标注决策解释(either from LLMs or humans)生成指导性规则,并通过一系列操作(如删除、添加、重排序和合并)对这些规则进行优化,从而构建出包含明确指导语的提示。这不仅提升了模型性能(实验显示比无指导提示和随机选择指导提示分别提升最高达31%和35%),还使整个优化过程对非专业用户也具备可理解性和可操作性,尤其适用于需要专业知识的复杂领域。
链接: https://arxiv.org/abs/2605.18113
作者: Jiahui Li,Sean Papay,Roman Klinger
机构: University of Bamberg, Germany(巴伐利亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prompt optimization has often been framed as a discrete search problem to find high-performing and robust instructions for an LLM. However, the search result might not make it transparent why and where specific prompt changes lead to performance gains. This is in contrast to how humans are instructed for annotation tasks. Here, researchers carefully design annotation guidelines, leading to enhanced annotation consistency. Our paper aims at joining these two approaches and introduces iPOE, a novel interpretable prompt optimization strategy via explanations. We guide the prompt optimization process by automatically created guidelines from explanations of annotation decisions (either automatically generated or from humans). This set of guidelines is furthermore optimized by as series of operations, including removing, adding, shuffling, and merging. The resulting prompt includes guidelines that instruct the annotation, making the decision process of the LLM and the optimization transparent. It therefore supports also laypeople in the area of prompt optimization, particularly in challenging domains requiring expertise. In our experiments on four datasets, we find that iPOE can improves over prompts without guidelines and with random selected guidelines by up to 31% and 35% , respectively. Moreover, LLM explanations can replace human explanations in the proposed method.
[NLP-43] How Good LLM s Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking ALT AAAI
【速读】: 该论文试图解决的问题是:缺乏针对孟加拉语(Bangla)的医学视觉问答(MedVQA)基准数据集,导致低资源语言在医疗多模态任务中的评估与模型优化受限。解决方案的关键在于构建并发布首个面向孟加拉语的医学视觉问答数据集 BanglaMedVQA,该数据集包含临床验证的图像-问题-答案对,并通过系统性评估当前主流基础模型(如 Gemini、GPT-4.1 mini 和 Gemma-3)在该数据集上的表现,揭示了现有模型在低资源语言环境下进行精细医学推理能力的严重不足,从而凸显了开发高质量评估方法和针对性改进策略的紧迫性。
链接: https://arxiv.org/abs/2605.18111
作者: Rafid Ahmed,Intesar Tahmid,Mir Sazzat Hossain,Tasnimul Hossain Tomal,Md Fahim,Md Farhad Alam Bhuiyan
机构: 1. University of Dhaka (达卡大学); 2. Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, 5 tables, Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:1-14, 2026
Abstract:Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.
[NLP-44] How Loud Rumbles Hit Newsstands: A Data Analysis of Coverag e and Spatial Bias in German News about Landslides Around the World
【速读】: 该论文试图解决的问题是如何系统性地从德国报纸中提取、地理定位并验证全球滑坡事件的新闻报道,以构建高质量的灾害数据库,并揭示媒体对国际灾害关注的不平等现象。解决方案的关键在于:首先,对近6万篇关于5500个滑坡事件的新闻文章进行大规模筛选与地理编码;其次,将这些数据与外部衡量各国滑坡易发性的指标进行对比分析,从而识别出如南欧和西欧地区存在显著高报现象,为后续研究媒体关注度差异提供了可量化的实证基础。
链接: https://arxiv.org/abs/2605.18105
作者: Brielen Madureira,Andreas Niekler,Marc Keuschnigg,Mariana Madruga de Brito
机构: LeipzigLab – Climate Discourse, Leipzig University, Germany; Helmholtz Centre for Environmental Research, Germany; Computational Humanities, Leipzig University, Germany; Institute of Sociology, Leipzig University, Germany; Institute for Analytical Sociology, Linköping University, Sweden
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Landslides often hit newsstands due to their destructive and potentially fatal effects. News are a valuable source of information for creating or enriching disaster databases and for expediting media-based studies of the dynamics of media attention. To accomplish that, news datasets must be filtered, geolocated and validated. This paper focuses on how landslides around the world are reported in German newspapers. We analyse almost 60k news articles about 5.5k news events in a 25-year period, compare it with external measures of countries’ susceptibility to landslides and provide insights, e.g.~the overreporting of Southern and Western Europe, to foment further studies on inequalities in media attention to international disasters.
[NLP-45] A Data-Efficient Path to Multilingual LLM s: Language Expansion via Post-training PARAMΔ Integration into Upcycled MoE
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在扩展至新语言时面临的高昂成本问题,尤其是持续预训练(Continued Pre-Training, CPT)和数据密集型对齐(alignment)阶段带来的资源消耗与性能权衡。现有无数据合并技术虽能绕过对齐步骤,但存在关键矛盾:为缓解参数冲突以保留原语言能力会导致新语言习得能力下降,反之亦然。论文提出的解决方案核心在于将稠密模型升级为专家混合(Mixture-of-Experts, MoE)架构,通过分配不同专家模块给不同语言来实现语言特异性表征;随后通过嫁接一个MoE扩展的参数增量(\Delta_\textpost)到CPT增强的基础模型上,直接转移对齐能力,从而跳过复杂的对齐过程。实验证明,该方法在FLOPs或参数量相近的基线中表现更优,不仅显著提升新增语言性能,还能有效保留原始语言能力,并展现出跨多种模型和后训练参数增量的良好适应性。
链接: https://arxiv.org/abs/2605.18083
作者: Hao Zhou,Tianhao Li,Zhijun Wang,Shuaijie She,Linjuan Wu,Hao-Ran Wei,Baosong Yang,Jiajun Chen,Shujian Huang
机构: Nanjing University (南京大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~( \Delta_\textpost ) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method’s superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.
[NLP-46] he Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought ICML2026
【速读】: 该论文试图解决现有关于Transformer表达能力的理论结果与实际模型之间存在的脱节问题,即多数研究依赖于hardmax注意力机制、高精度计算等不切实际的假设,而无法反映真实场景中使用softmax注意力和量化激活/注意力权重的Transformer。其解决方案的关键在于构建一种基于ternary(三值)激活和分离注意力得分的hardmax Transformer,该结构能够模拟图灵机(Turing machines),并通过Chain-of-Thought(CoT)推理路径实现可转换为标准softmax Transformer的等效构造,从而避免了先前方法所需的极端参数规模或高精度要求。进一步地,作者将此技术扩展至最近提出的“摘要式CoT”范式,证明其在空间复杂度上更高效——模型规模仅随空间边界对数增长,而非时间边界。实验部分在Sudoku推理任务上验证了理论预测,结果显示该框架对可学习性的预测优于以往高精度理论结果。
链接: https://arxiv.org/abs/2605.18079
作者: Moritz Brösamle,Stephan Eckstein
机构: 未知
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注: Accepted to ICML 2026
Abstract:Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at this https URL.
[NLP-47] KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
【速读】: 该论文旨在解决长上下文大语言模型(Large Language Models, LLMs)在推理过程中因键值(Key-Value, KV)缓存占用大量显存而导致的性能瓶颈问题。现有缓存卸载系统虽将完整KV缓存存储于主机内存并按需调取关键条目,但受限于稀疏性提升的物理极限,随着上下文长度和批处理规模增长,KV数据传输量急剧上升,成为解码延迟的主要来源。KVDrive提出了一种跨GPU显存、主机DRAM与SSD的多层级KV缓存管理方案,其核心创新在于从系统层面协同优化缓存放置策略、流水线调度以及跨层级数据协调:首先,根据注意力机制行为动态调整缓存布局以最大化复用并减少冗余移动;其次,重构解码流水线以重叠I/O与CPU/GPU计算阶段,消除异构资源间的等待空闲;最后,统一各内存层级间的数据流动,实现远超GPU和DRAM容量限制的可扩展长上下文推理。实验表明,KVDrive在主流LLM长上下文基准测试中相较当前最优方案实现了最高1.74倍的吞吐量提升且保持精度不变。
链接: https://arxiv.org/abs/2605.18071
作者: Jian Lin,Jiazhi Mi,Zicong Hong,Haodong Wang,Qianli Liu,Haodyue Zhang,Peng Li,Song Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.
[NLP-48] PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence
【速读】: 该论文旨在解决边缘设备上部署个性化大语言模型(LLM)代理时,如何实现用户间基于代理专长的对等(P2P)协作问题。现有系统难以应对动态变化的代理池及其可替换能力带来的匹配与负载均衡挑战。解决方案的关键在于提出两个核心机制:一是基于原型的可扩展查询-代理对评分机制,用于在高 churn 环境下精准识别合适代理;二是多代理互操作贝叶斯博弈模型,以在远程代理负载变化过快无法实时观测的情况下,平衡本地需求与全局效率。实验表明,所提出的 PPAI 系统显著扩展了可执行任务范围,在保持负载均衡的同时,平均任务准确率提升达 7.96%,延迟降低 16.34%。
链接: https://arxiv.org/abs/2605.18067
作者: Zile Wang,Qianli Liu,Kaibin Guo,Haodong Wang,Jian Lin,Zicong Hong,Song Guo
机构: Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent’s expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.
[NLP-49] Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction
【速读】: 该论文旨在解决在共享全局缓存容量限制下的键值缓存(KV cache)驱逐策略所面临的性能崩溃问题,特别是针对纯Transformer模型在提示边界(prompt-boundary)处因缺乏结构保护而导致的生成质量急剧下降(F1 ≤ 0.064)。其解决方案的关键在于引入边界保护机制——即在每个提示边界预留10%的缓存空间,从而有效恢复模型在低缓存容量(如C=256)下的性能至参考天花板质量的69–90%。研究进一步表明,这种保护机制显著削弱了不同驱逐策略(如LRU、Ada-KV、QUEST等)之间的评分差异,使简化版的隔离评分策略在K=32时与LRU相当(Δ=0.02),而在K=8时优于LRU(F1提升0.011–0.021)。此外,忠实实现的Ada-KV/QUEST还能额外带来约0.03–0.04的F1增益;在长序列(如NIAH-32K)场景下,边界保护同样具有一致有效性,而每头分配(per-head allocation)则可提供进一步小幅收益。总体而言,边界保护是性能恢复的核心因素,评分机制的作用次之。
链接: https://arxiv.org/abs/2605.18053
作者: Gabriel Garcia
机构: Independent Researcher
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Performance (cs.PF)
备注: 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: this https URL
Abstract:We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1 \leq 0.064). Reserving 10% of cache at each boundary recovers 69–90% of the C=2,048 reference-ceiling quality on seven LongBench models at C=256 (13% retention); a ten-model panel spans 68–98%. An attention-mass pilot (Qwen2.5-3B, N=30 ) suggests why: the position-0 sink holds \sim75% of prefix mass, while other boundary tokens sit near \sim0.41\times uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at K=32 ( \Delta=0.02 ); at K=8 , attention policies pairwise converge yet beat LRU by 0.011–0.021 F1 across C=256 and C=512 . Faithful Ada-KV/QUEST add \sim0.03 – 0.04 F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, C\in\512,2048\ ) shows near-identical protection lifts (ratio 0.99–1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.
[NLP-50] Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling ACL2026
【速读】: 该论文试图解决的问题是:当前语言模型(LMs)在处理修辞角色标注(Rhetorical Role Labeling, RRL)任务时,虽然平均性能表现良好,但在困难样本上预测置信度低、可靠性差的问题。现有方法通常隐式处理不确定性,并将标签视为离散标识符,忽略了标签名称中蕴含的语义信息。解决方案的关键在于提出RISE(Reranking Inference with Semantic Embeddings)——一个推理阶段的语义重排序框架,它通过对比学习获得标签表示,自动识别低置信度预测并基于标签语义重新排序输出,无需修改或重新训练底层模型。实验表明,RISE在8个领域特定RRL数据集上对困难样本的宏F1值平均提升9.15点,显著改善了模型在边界案例上的鲁棒性。
链接: https://arxiv.org/abs/2605.18007
作者: Anas Belfathi,Nicolas Hernandez,Laura Monceaux,Warren Bonnard,Richard Dufour
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France; University of Lorraine, France
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Main Conference)
Abstract:Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen’s kappa = 0.40.
[NLP-51] Bridging the Gap: Converting Read Text to Conversational Dialogue
【速读】: 该论文试图解决的问题是:如何将朗读语音(read speech)转换为具有自然语调和韵律的对话式语音(conversational speech),同时在保证语音自然度和可懂度的前提下,降低计算开销以适应实时应用场景。传统方法难以保留对话中必要的韵律变化,导致生成语音缺乏真实感,限制了其在虚拟助手、客户服务和语言学习工具等领域的应用。解决方案的关键在于提出一种名为“基于对话上下文的韵律调整”(Prosodic Adjustment with Conversational Context, PACC)的新方法,该方法利用深度神经网络分析并调整语调、重音和节奏等韵律特征,并结合高保真生成对抗网络(High-Fidelity Generative Adversarial Networks, HiFi-GAN)进行语音合成,从而显著提升转换后的语音自然度与模型准确性。实验结果表明,PACC在MOS(Mean Opinion Score)评估中优于现有方法,且具备良好的扩展性,适用于多种语音转换任务。
链接: https://arxiv.org/abs/2605.18001
作者: Parshav Singla,Agnik Banerjee,Aaditya Arora,Shruti Aggarwal,Anil Kumar Verma,Vikram C M,Raj Prakash Gohil,Gopal Kumar Agarwal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems
Abstract:In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.
[NLP-52] Predictive Prefetching for Retrieval-Augmented Generation ICML2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)模型在实际应用中因同步检索导致的高延迟问题。现有异步检索方法依赖启发式协调机制,且假设生成过程中信息需求稳定,这在多领域复杂场景下难以成立。论文提出了一种先进的异步检索框架,其核心创新在于通过三个组件——检索预测器(retrieval predictor)、上下文监控器(context monitor)和查询生成器(query generator)——实现与生成动态中语义前兆(semantic precursors)对齐的预测性预取(predictive prefetching)。该框架能够提前数个token识别出检索触发时机及所需信息,从而显著降低延迟:实验表明,相比同步RAG基线,该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到首个token(time-to-first-token)提升,同时保持相当的答案质量。
链接: https://arxiv.org/abs/2605.17989
作者: Wuyang Zhang,Shichao Pei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Forty-third International Conference on Machine Learning ICML 2026
Abstract:Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.
[NLP-53] AutoVecCoder: Teaching LLM s to Generate Explicitly Vectorized Code
【速读】: 该论文试图解决的问题是:如何利用大语言模型(LLM)实现高效的显式向量化(explicit vectorization),以克服传统自动向量化(auto-vectorization)因保守静态分析而导致的性能瓶颈。当前LLM在生成代码时难以处理低级硬件指令(如SIMD指令),主要受限于高质量代码语料库稀缺和向量指令严格的语义约束。解决方案的关键在于提出AutoVecCoder框架,其核心创新包括:(1) VecPrompt——一种自动化数据合成管道,用于注入领域特定的向量指令知识;(2) VecRL——一种强化学习框架,使生成代码与执行效率对齐。通过该框架训练的AutoVecCoder-8B模型在SimdBench的SSE和AVX子集上达到最先进性能,并在某些场景下超越标准-O3优化,从而有效实现了高效、可落地的显式向量化。
链接: https://arxiv.org/abs/2605.17978
作者: Shangzhan Li,Xinyu Yin,Xuanyu Jin,Ye He,Yuxin Zhou,Yuxuan Li,Xu Han,Wanxiang Che,Qi Shi,Ting Liu,Maosong Sun
机构: Harbin Institute of Technology (哈尔滨工业大学); Xiamen University (厦门大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.
[NLP-54] BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting KDD2026
【速读】: 该论文试图解决量化回测(quantitative backtesting)在自动化过程中面临的高技术门槛和可扩展性不足的问题。当前,尽管大型语言模型(Large Language Models, LLMs)具备通过代码生成、工具调用和代理规划实现该复杂跨学科流程自动化的潜力,但缺乏一个大规模基准数据集严重制约了该方向的发展。为此,作者提出了BacktestBench——首个面向自动化量化回测的大规模基准,其基于超过600万条真实市场记录构建,包含18,246个精心标注的问答对,覆盖指标计算、标的筛选、策略选择与参数确认四类任务。解决方案的关键在于提出AutoBacktest这一多智能体基线框架,它通过Summarizer提取语义因子、Retriever生成验证过的SQL查询、Coder实现Python回测逻辑,从而将自然语言交易策略转化为可复现的回测结果。实验评估23个主流LLM并辅以消融分析,揭示了端到端性能的关键影响因素,强调了基于事实的验证机制和标准化指标表示的重要性。
链接: https://arxiv.org/abs/2605.17937
作者: Zhensheng Wang,Wenmian Yang,Qingtai Wu,Lequan Ma,Yiquan Zhang,Weijia Jia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)
Abstract:Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.
[NLP-55] Universal Adversarial Triggers
【速读】: 该论文旨在解决当前通用对抗攻击(universal adversarial attacks)中生成的触发词(trigger sequence)缺乏语法正确性和自然性的问题,从而使得攻击更隐蔽且更具实际威胁。解决方案的关键在于提出一种结合词性过滤(parts-of-speech filtering)与基于困惑度(perplexity)的损失函数的新方法,以生成语义合理、接近自然语言的触发词。实验表明,在SST情感分析数据集上,该方法可生成具有高攻击效果的自然触发词,使模型准确率降至0.04至0.12;同时通过使用这些触发词进行对抗训练,模型鲁棒性显著提升,准确率从0.12提高到0.48,验证了该方法在增强模型防御能力方面的有效性。
链接: https://arxiv.org/abs/2605.17936
作者: Benedict Florance Arockiaraj,Alexander Feng,Jianxiong Cai,Xiaoyu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.
[NLP-56] Prompt Compression in Diffusion Large Language Models : Evaluating LLM Lingua-2 on LLaDA
【速读】: 该论文试图解决的问题是:当前主流的提示压缩(prompt compression)方法主要针对自回归大语言模型(autoregressive large language models)设计,但其在扩散式大语言模型(diffusion large language models, DLLMs)中的有效性尚未明确。研究发现,尽管压缩后的提示在语义上保持相似性,但在扩散模型中仍可能导致下游任务性能显著下降,尤其在数学推理等任务中表现不稳定。解决方案的关键在于揭示了提示压缩在扩散模型中存在“语义保留 ≠ 下游行为稳定”的现象,并通过实验证明压缩失败的主要原因是关键信息的遗漏而非语义偏移(information omission rather than semantic drift)。具体而言,BERTScore结果显示召回率始终低于精确率,表明压缩过程更易丢失对扩散去噪过程至关重要的推理信息。因此,论文强调需开发面向扩散模型特性的压缩策略(diffusion-aware compression strategies),以实现高效且稳定的推理性能。
链接: https://arxiv.org/abs/2605.17932
作者: Sterling Huang,Abigayle Brown,Jiyoo Noh,Jiakang Xu,Wantong Huo,Kaung Myat Kyaw,Jonathan Chan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2 \times compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.
[NLP-57] A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration
【速读】: 该论文试图解决的问题是:在行星探索等极端通信受限、缺乏全局定位且需最小人工干预的环境中,如何将高阶任务知识(如任务目标、操作约束和环境变化)从自然语言(NL)形式转化为结构化的、机器可解释的逻辑表示,以支持自主机器人代理的推理能力。解决方案的关键在于构建一个面向行星探索领域的自然语言到一阶逻辑(First-Order Logic, FOL)的初步基准数据集,该数据集基于NASA行星数据系统(PDS)中2003至2013年间真实任务文档,手动标注了包含时序结构、代理角色和操作依赖关系的FOL表达式,并提供了结构化的谓词词汇表与类型化常量,从而为语言理解与形式推理交叉研究提供了一个基于实际、安全关键任务数据的实验基础。
链接: https://arxiv.org/abs/2605.17911
作者: Hayden Moore,Suman Saha,Mahfuza Farooque
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA’s Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.17911 [cs.CL] (or arXiv:2605.17911v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.17911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-58] Multi-agent AI systems outperform human teams in creativity
【速读】: 该论文试图解决的问题是:如何提升人工智能系统在创造性任务中的表现,尤其是在与人类团队对比时,探索多智能体大语言模型(multi-agent LLM teams)是否具备超越人类的创造力,并揭示其背后的生成机制。解决方案的关键在于:通过构建多智能体LLM团队,在六类多样化问题解决任务中显著优于单智能体和人类团队(Cohen’s d=1.50),且优势主要来源于更高的新颖性(novelty)而非有用性(usefulness)的下降;进一步地,研究发现LLM团队的创造性源于高效探索(高语义扩散、路径更短),而人类团队则依赖于流畅的对话结构(高局部一致性、频繁话题转换),并识别出模型选择和讨论结构是影响LLM对话动态的两个独立设计变量,共同解释了26.8%的方差,为系统性增强多智能体系统的创造力提供了可操作的设计框架。
链接: https://arxiv.org/abs/2605.17885
作者: Tiancheng Hu,Yixuan Jiang,Haotian Li,José Hernández-Orallo,Xing Xie,Nigel Collier,David Stillwell,Luning Sun
机构: University of Cambridge (剑桥大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen’s d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.
[NLP-59] HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
【速读】: 该论文试图解决长时程大语言模型(LLM)智能体在强化学习训练中因稀疏结果奖励导致的策略优化困难问题,即奖励信号仅能指示任务最终成败,而无法明确指出哪些中间动作对结果产生影响或如何修正。现有方法通过逐轮生成奖励或文本提示,或采用反馈条件下的自蒸馏(feedback-conditioned self-distillation)来缓解此问题,但存在效率低下和监督不精准的问题——例如,在大量成功或中性动作上重复生成反馈会浪费计算资源,且固定或错位的反馈时机难以聚焦于真正导致失败的动作。论文提出HINT-SD框架,其核心创新在于利用完整轨迹的“事后视角”(full-trajectory hindsight)精准识别与失败相关的动作片段,并仅在这些目标动作区间应用反馈条件蒸馏,从而实现更高效、更有效的长时程智能体训练。实验表明,该方法相比密集逐轮反馈基线提升高达18.80%,同时每训练步骤耗时降低2.26倍,验证了“选择何时蒸馏”是提升训练效果与效率的关键因素。
链接: https://arxiv.org/abs/2605.17873
作者: Woongyeng Yeo,Yumin Choi,Taekyung Ki,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26 \times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.
[NLP-60] PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions LREC2026
【速读】: 该论文旨在解决现代自动语音识别(ASR)系统在面对真实世界语音变异性时性能下降的问题,特别是由口音差异、非结构化口语表达及领域特定语言所引起的挑战。其解决方案的关键在于构建了一个名为PAREDA(PAper REading DAtaset)的多口音语音数据集,该数据集包含具有澳大利亚、印度英语和中文英语口音的讲者围绕自然语言处理(NLP)学术论文进行的讨论,涵盖自发性独白(论文摘要)与问答对话两种语境,从而引入丰富的技术术语和会话现象。实验表明,在零样本设置下,现有最优ASR模型表现较差,验证了数据集的挑战性;而通过在PAREDA上微调模型可显著降低词错误率(WER),证明该数据集能有效捕捉现有语料库中缺失的语言特征,为构建更鲁棒且包容性强的专用场景ASR系统提供了重要资源。
链接: https://arxiv.org/abs/2605.17860
作者: Sicheng Jin,Dipankar Srirag,Aditya Joshi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted and presented at SPEAKABLE 2026 workshop at LREC 2026
Abstract:While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented, spontaneous, and domain-specific speech. In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. Each session elicits a spontaneous monologue (a summary of a paper’s abstract) and a non-monologue (a question-and-answer session between participants), resulting in a corpus rich with technical jargon and conversational phenomena. We evaluate the performance of SOTA ASR models on PAREDA, analysing the impact of accent mixing and increased speech rate. Our results show that, in the zero-shot setting, models perform worse, confirming the dataset’s challenging nature. However, fine-tuning on PAREDA significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora. PAREDA serves as a valuable new resource for building and evaluating more robust and inclusive ASR systems for specialised, real-world applications.
[NLP-61] Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)预训练从计算密集型向数据密集型转变过程中,可用人类生成文本(organic text)远不足以满足模型规模扩展需求的问题。其核心挑战在于:即便在数据受限的情况下,模型也未充分学习已有有机数据中的知识。解决方案的关键是提出 SynPro 框架,通过两个无外部信息引入的操作——重述(rephrasing)和重新格式化(reformat)——以多样化形式呈现相同源数据,从而促进模型更深入地学习。这两个生成器采用强化学习优化,奖励机制包括质量、忠实性(faithfulness)和数据影响(data influence),并能在预训练达到瓶颈时持续更新,聚焦于模型尚未吸收的内容。实验表明,在仅使用 Chinchilla 最优token数10%的数据下(400M和1.1B参数模型),SynPro 可实现相当于重复训练3.7–5.2倍的有效token利用率,甚至超越非数据受限的基准模型,同时避免分布坍缩,验证了在数据边界条件下提升模型学习效率的有效性。
链接: https://arxiv.org/abs/2605.17849
作者: Zichun Yu,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); Xlue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at this https URL.
[NLP-62] Remembering More Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
【速读】: 该论文试图解决的问题是:当前对具备记忆能力的大语言模型(LLM)代理的安全评估主要聚焦于单任务内的安全性(如对抗性提示注入或记忆污染),而忽略了在长期部署中,代理因累积多个独立任务的记忆内容而导致后续行为发生不可预测偏移的潜在风险——即“时间维度上的记忆污染”(temporal memory contamination)。解决方案的关键在于提出一种新的评估协议——触发-探测协议(trigger-probe protocol),通过在不同前缀长度下对只读记忆快照进行固定探针测试,并引入“NullMemory”反事实基线来识别由记忆引发的安全违规。实验表明,无论何种记忆架构或代理类型(包括OpenClaw类AI代理),随着记忆暴露时长增加,内存诱导违规率均呈现稳定上升趋势;且顺序随机化实验验证了该效应主要源于累积内容而非交互顺序。此外,研究还发现记忆风险可在生成前从检索状态中被高召回率诊断监控器检测到,从而强调将记忆安全视为需纵向评估的时序属性,而非静态快照可捕捉的单一状态属性。
链接: https://arxiv.org/abs/2605.17830
作者: Ahmad Al-Tawaha,Shangding Gu,Peizhi Niu,Ruoxi Jia,Ming Jin
机构: Virginia Tech; University of California, Berkeley; University of Illinois Urbana-Champaign; Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform’s native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.
[NLP-63] SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?
【速读】: 该论文试图解决的问题是:当前为单用户对话设计的记忆系统在多人群组社交场景中表现失效,无法有效处理群体共享历史、区分群体规范与个体例外以及成员离群后的正确归属等关键挑战。解决方案的关键在于提出SocialMemBench——一个涵盖五类社会群体原型(亲密朋友、家庭、休闲社群、兴趣社区、熟人网络)和三类规模层级(4–30人)的合成数据集,包含430个角色和7,355轮对话,生成1,031个问答对,用于系统性测试记忆架构的九种能力维度,并识别出五种典型失败模式(如单一流混叠、时间状态覆盖、大规模实体合并错误等)。实验表明,即使使用完整对话上下文的先进模型(如Gemini 2.5 Flash),其准确率也仅为0.721,远低于盲评推理模型的0.98,且主流开源记忆框架(Mem0、LangMem等)性能集中在0.12–0.18区间,显著落后于基线参考模型,揭示了现有记忆系统在多人群组场景中的显著能力缺口。
链接: https://arxiv.org/abs/2605.17789
作者: Olukunle Owolabi
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.
[NLP-64] Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLM s at Million-Note Scale
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成临床文本时缺乏系统性评估的问题,尤其关注其在真实医疗场景中的信息保真度、任务适用性和事实准确性。解决方案的关键在于构建一个涵盖内在特性(intrinsic)、外在效用(extrinsic)和事实正确性(factuality)的多维度评估框架,并基于百万级MIMIC数据库重构的临床笔记进行实证分析。研究发现,尽管LLM生成的病历在粗粒度任务中能保留核心临床信息和预测能力,但在细粒度任务(如ICD编码)中会丢失关键细节;通过分块重写(chunk-based rephrasing)可显著缓解这一问题,但可能因上下文不完整导致事实精度下降;进一步的错误分析表明,合成错误主要源于对临床语境的误读、时间顺序混乱、测量误差及虚构陈述。最终,研究验证了任务无关的合成数据在罕见ICD编码任务中仍具有有效的增量训练价值。
链接: https://arxiv.org/abs/2605.17775
作者: Jinghui Liu,Sarvesh Soni,Anthony Nguyen
机构: Australian e-Health Research Centre, CSIRO (澳大利亚电子健康研究中心,CSIRO); National Library of Medicine, National Institutes of Health (美国国家医学图书馆,美国国立卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect – such as similarity or utility comparisons – even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes – despite their task-agnostic nature – can effectively augment task-specific training for rare ICD codes.
[NLP-65] Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
【速读】: 该论文试图解决的问题是:在基于大语言模型(Large Language Models, LLMs)的智能体系统中,当前工具调用(tool-use)流程需要在每个提示(prompt)中包含完整的工具描述(tool schema),导致显著的令牌(token)开销,限制了小型模型的实际应用。解决方案的关键在于通过参数高效微调(parameter-efficient fine-tuning),将工具使用知识内化到小型语言模型的权重中,从而在推理阶段无需显式提供工具描述即可实现结构化的规划能力。作者采用AssetOpsBench作为基准,对Gemma 4E4B和Qwen3-4B模型进行8-bit QLoRA微调,训练数据涵盖工具知识、问题到计划的映射以及执行轨迹,共约1700个样本。实验表明,在无工具描述的推理环境下,微调后的模型相比接收完整工具描述的基线模型,输入长度减少82.6%,同时提升结构化规划质量和LLM评分;其中最佳Gemma模型达到AT-F1为0.65、整体评分3.88,优于基线的0.47和2.88;而Qwen3-4B在保持高评分(3.78)的同时内存占用减少62%、推理速度提升2.5倍,尽管存在一定的灾难性遗忘现象。消融实验进一步揭示LoRA秩(rank)控制着性能与知识保留之间的权衡,r=32时规划质量最优,更低秩则更利于保持通用知识。结果表明,对于固定工具集,QLoRA微调能够将工具知识从提示上下文迁移至模型权重,大幅降低推理开销并维持或提升工具规划质量。
链接: https://arxiv.org/abs/2605.17774
作者: Yuval Shemla,Ayal Yakobe,Tanmay Agarwal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62% less memory and running 2.5 \times faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality–retention trade-off, with r=32 maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.
[NLP-66] Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂数学与逻辑任务中表现优异时,其内部推理机制难以通过词元级行为分析进行有效理解的问题,以及依赖昂贵外部验证器的强化学习(Reinforcement Learning, RL)优化方法稳定性差的局限性。解决方案的关键在于提出了一种名为“熵-梯度逆向”(Entropy-Gradient Inversion)的新理论发现:即词元熵与logit梯度之间存在稳健的负相关关系,这构成了LRM推理能力的几何指纹特征;并基于此设计了“相关性正则化组策略优化”(Correlation-Regularized Group Policy Optimization, CorR-PO),将该逆向签名嵌入RL奖励正则化中,从而实现更稳定、高效的推理优化。实验表明,CorR-PO在多个模型规模和推理基准上均显著优于现有最优方法,且更强的熵-梯度逆向程度直接对应更高的推理性能。
链接: https://arxiv.org/abs/2605.17770
作者: Junyao Yang,Chen Qian,Kun Wang,Linfeng Zhang,Quanshi Zhang,Yong Liu,Dongrui Liu
机构: National University of Singapore; Renmin University of China; Shanghai Artificial Intelligence Laboratory; Shanghai Jiao Tong University; Nanyang Technological University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 5 figures, 9 tables
Abstract:The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive fast thinking'' text generation to systematic, step-by-step slow thinking’’ reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textitthe fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers. We identify and formally define \textbfEntropy-Gradient Inversion, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbfCorrelation-Regularized Group Policy Optimization (CorR-PO), which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.
[NLP-67] Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction Especially for Rare Codes
【速读】: 该论文试图解决临床编码(Clinical coding)中因国际疾病分类(ICD)版本迭代导致的模型版本依赖问题,以及罕见代码(long-tail problem)性能不足限制实际部署的问题。其解决方案的关键在于:通过融合不同ICD版本(如ICD-9与ICD-10)的标注数据来训练版本无关的编码模型,具体采用改进的按标签注意力(label-wise attention)机制,在仅增加少量参数的情况下,显著提升了罕见代码的微F1分数(提升27%),同时在常见代码上也改善了宏观指标,验证了跨版本训练在提高模型泛化性和实用性方面的可行性。
链接: https://arxiv.org/abs/2605.17755
作者: Jinghui Liu,Anthony Nguyen
机构: Australian e-Health Research Centre, CSIRO (澳大利亚电子健康研究中心,CSIRO)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.
[NLP-68] From Documents to Segments: A Contextual Reformulation for Topic Assignment ACL2026
【速读】: 该论文试图解决传统主题建模方法在处理多主题文档时存在的问题,即由于将单一主题分配给整个文档而导致的主题污染(topic contamination),从而影响对特定主题文档的识别与分析。其解决方案的关键在于提出一种基于段落的主题分配方法(segment-based topic allocation, SBTA),将主题分配从文档级别细化到短而连贯的文本片段(segment)级别,每个片段仅表达一个明确的主题。这种方法通过在细粒度层级上建模主题结构,显著提升了主题的清晰度和可解释性,并支持对多主题文档更精准的分析。为验证效果,作者构建了SemEval-STM数据集并设计了段落级别的词入侵任务(segment-level extension of the word intrusion task),实现了对主题一致性在实际分配粒度上的系统评估,实验表明SBTA在聚类质量和可解释性方面均优于现有方法。
链接: https://arxiv.org/abs/2605.17714
作者: Hoonsang Yoon,Takyoung Kim,Wonkee Lee,Ilmin Cho,Dilek Hakkani-Tür,Stanley Jungkyu Choi
机构: LG AI Research; University of Illinois Urbana-Champaign
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2026
Abstract:Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: this https URL
[NLP-69] Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation
【速读】: 该论文旨在解决多语言自动语音识别(ASR)系统在尼日利亚本土语言上性能显著落后于英语和法语等高资源语言的问题。其核心挑战包括数据稀缺、拼写不一致、声调符号、方言差异、频繁的代码混用以及本地化命名实体等问题。解决方案的关键在于提出一种两阶段知识蒸馏框架:首先,利用已有单语模型进行条件化的学生-教师知识蒸馏,并结合鲁棒的语言特定N-gram语言模型;其次,通过伪标签数据迭代自提升进一步优化模型精度。该方法平均相对词错误率(WER)较单语基线降低29%,并在Common Voice和Fleurs等主流基准上优于当前最先进的多语言模型。作者还发布了SBPN(Sometin Beta Pass Notin),一个覆盖约鲁巴语、豪萨语、伊博语、尼日利亚皮钦语和尼日利亚英语的基础多语言ASR模型,包含两个版本(SBPN-Base: 120M参数,SBPN-Large: 600M参数),以促进对非洲地区语音与文化多样性研究的进一步发展。
链接: https://arxiv.org/abs/2605.17710
作者: Sewade Ogun
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 25 pages
Abstract:Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yorùbá, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.
[NLP-70] Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations? ACL2026
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在被赋予不同社会地位角色时,是否会表现出与人类相似的权力相关社交认知效应,如语言协调、代词使用、权威偏倚及对不当请求的顺从行为。解决方案的关键在于通过模拟多轮、权力不对称的对话情境(例如校长-教师、法官-律师等职业角色),系统性地测量四个指标:(i)语言协调性、(ii)代词使用模式、(iii)说服成功率以及(iv)对不安全请求的合规程度。研究发现,LLMs确实表现出权力相关的社会认知效应,尽管存在细微差异和可变性,这揭示了其交互行为既可能带来有益影响,也可能引发潜在风险。
链接: https://arxiv.org/abs/2605.17694
作者: Anvesh Rao Vijjini,Sagar Manjunath,Snigdha Chaturvedi
机构: UNC Chapel Hill
类目: Computation and Language (cs.CL)
备注: ACL 2026 (main)
Abstract:Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) linguistic coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.
[NLP-71] Validate Your Authority: Benchmarking LLM s on Multi-Label Precedent Treatment Classification EMNLP
【速读】: 该论文旨在解决法律判例中负面治疗(negative treatment)自动分类这一关键但复杂的自然语言处理(Natural Language Processing, NLP)任务中存在的误分类风险问题。传统以准确率(accuracy)为主的评估方式无法充分反映错误分类在实际法律场景中的严重后果,因此论文提出了一种更鲁棒的评估框架,其关键创新在于引入了“平均严重性误差”(Average Severity Error)这一新指标,用于量化分类错误的实际影响。实验表明,不同大语言模型(Large Language Models, LLMs)在不同粒度的任务上表现各异:Google的Gemini 2.5 Flash在高层级分类任务中表现最优(准确率为79.1%),而OpenAI的GPT-5-mini在细粒度分类任务中领先(准确率为67.7%)。该研究不仅为该法律推理任务建立了基准,还提供了高质量、专家标注的语境丰富数据集和面向实践需求的评估方法。
链接: https://arxiv.org/abs/2605.17691
作者: M. Mikail Demir,M. Abdullah Canbaz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP
Abstract:Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google’s Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI’s GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.
[NLP-72] Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在生成长链思维(Chain of Thought, CoT)过程中存在“过度思考”问题,即模型在答案已稳定后仍继续冗余推理,导致token浪费和延迟增加。现有推理时早停方法主要依赖答案层面的信号(如置信度或试答一致性),但这些信号仅反映答案准备就绪程度,而非推理收敛状态,易引发过早退出,降低最终答案准确率并破坏推理链语义完整性。解决方案的关键在于引入推理层级语义冗余作为补充信号:当连续推理步骤不再带来新进展、仅重复已有结论时,表明推理轨迹已收敛。基于此,作者提出PUMA框架——一个轻量级冗余检测器与答案级验证机制相结合的即插即用方案,通过识别语义冗余候选退出点并验证其安全性,实现冗余推理的去除,同时保持答案准确性和推理前缀的连贯性。实验表明,PUMA在五种LRM和五个高难度推理基准上平均减少26.2% token消耗,且不牺牲准确率和推理质量,并在代码生成、零样本视觉语言推理及停止策略内化等任务中验证了该信号的鲁棒性、可迁移性和可学习性。
链接: https://arxiv.org/abs/2605.17672
作者: Dehai Min,Giovanni Vaccarino,Huiyi Chen,Yongliang Wu,Gal Yona,Lu Cheng
机构: University of Illinois Chicago (芝加哥大学); Google Research (谷歌); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Politecnico di Milano (米兰理工大学)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \urlthis https URL.
[NLP-73] Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech LREC2026
【速读】: 该论文试图解决语音摘要(speech summarization)任务中缺乏可靠基准数据集的问题,尤其是在人工标注过程中可能引入系统性错误和偏见的情况下。解决方案的关键在于通过对比十种不同的标注流程(涉及音频、文本或两者结合输入,以及是否采用自编辑或同伴编辑),验证迭代式同伴编辑(iterative peer-editing)在提升音频基摘要质量方面的有效性。研究发现,仅基于音频的摘要比基于文本的摘要信息量更少且压缩程度更高,但通过迭代式同伴编辑可显著改善这一问题,使音频摘要的信息量与文本摘要及大型语言模型(LLM)生成的摘要相当。这表明,在无文本转录可用的情况下,采用迭代同伴编辑的人工标注方法仍能构建高质量、兼顾词汇与韵律信息的基准数据集。
链接: https://arxiv.org/abs/2605.17652
作者: Kaavya Chaparala,Thomas Thebaud,Jesús Villalba López,Laureano Moro-Velazquez,Peter Viechnicki,Najim Dehak
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in LREC 2026
Abstract:There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.
[NLP-74] Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
【速读】: 该论文试图解决长时程大语言模型(LLM)代理在跨会话交互中依赖持久记忆时,现有记忆系统因仅基于语义相似性或广义历史包含进行检索而导致的可靠性问题——即 retrieved memories 可能看似相关但实际无关、过时或具有误导性。解决方案的关键在于提出因果记忆干预(Causal Memory Intervention, CMI),其核心思想是通过因果推断估计候选记忆在受控干预下对模型输出的影响,从而选择能够提升任务表现的记忆,同时抑制不稳定、无关或有害的记忆。实验表明,CMI 在回答质量与抗误导性之间实现了更优平衡,验证了基于因果有用性而非单纯相关性的记忆选择机制对于构建可靠长期记忆系统的重要性。
链接: https://arxiv.org/abs/2605.17641
作者: Saksham Sahai Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 3 tables
Abstract:Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model’s answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at this https URL.
[NLP-75] AI Agents May Always Fall for Prompt Injections
【速读】: 该论文试图解决当前AI代理中prompt injection(提示注入)漏洞的防御失效问题,尤其是现有“数据-指令分离”防御范式在应对通过上下文操纵的攻击时无能为力,并且会损害代理在特定情境下的合理行为。其解决方案的关键在于引入Contextual Integrity(CI,上下文完整性)这一隐私理论框架,将提示注入重新建模为信息流是否符合上下文规范的问题。作者通过构建三类典型攻击场景——(1) 误导信息流、(2) 操纵上下文规范、(3) 混合多流信息——揭示了当前防御机制的根本局限性,并提出一个不可能性结论:攻击者总能构造出使非法信息流看似合法的上下文,而防御者若收紧规范则可能误伤合法行为。因此,论文主张转向基于CI的系统性评估与对齐设计,以应对未来自主代理面临的更复杂、情境敏感的攻击面。
链接: https://arxiv.org/abs/2605.17634
作者: Sahar Abdelnabi,Eugene Bagdasarian
机构: ELLIS Institute Tübingen; MPI-IS Tübingen AI Center; University of Massachusetts Amherst
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Prompt injection is the most critical vulnerability in deployed AI agents. Despite recent progress, we show that the prevailing defense paradigm (data-instruction separation) both fails to detect attacks that operate through contextual manipulation and degrades contextually appropriate behavior. We then recast prompt injection via the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms. This explains types of attacks that current defenses attempt to patch and predict advanced ones future agents will face. We develop unique benign and attack scenarios that force an agent to violate the norms by (1) misrepresenting the flow, (2) manipulating norms, or (3) mixing multiple flows. This reframing suggests an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, or a defender who tightens norms will block genuinely legitimate flows. Our findings suggest that current research addresses a shrinking fraction of future attack surfaces. Instead, through CI, we offer a principled framework for evaluating context-sensitive failures, and designing CI-aware alignment for the frontier autonomous agents.
[NLP-76] SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening
【速读】: 该论文试图解决在线视频平台和生成式AI内容快速增长背景下,如何实现高效且准确的视频内容安全审查(video guardrails)问题。现有方法普遍采用统一的大规模视觉-语言模型处理所有输入,导致计算资源浪费和推理成本过高。其解决方案的关键在于提出SafeLens框架,该框架引入“快慢并行”(fast-and-slow inference)架构,根据输入内容复杂度动态分配计算资源;同时通过影响引导过滤(influence-guided filtering)构建高质量小规模数据集(仅保留原始SafeWatch数据集的2.4%),并利用结构化思维链(Chain-of-Thought)增强测试时推理能力,从而在保持高精度的同时显著降低推理开销,证明了高效设计优于单纯扩大模型或数据规模。
链接: https://arxiv.org/abs/2605.17610
作者: Shahriar Kabir Nahin,Hadi Askari,Muhao Chen,Anshuman Chhabra
机构: University of South Florida (南佛罗里达大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.
[NLP-77] Mixture of Experts for Low-Resource LLM s
【速读】: 该论文试图解决的问题是:在混合专家(Mixture-of-Experts, MoE)架构中,低资源语言(如希伯来语)的专家路由行为缺乏系统性理解,尤其是在模型深层出现的路由坍缩(routing collapse)现象。解决方案的关键在于:通过持续预训练(Continual Pre-Training, CPT)使用平衡的双语数据,显著提升低资源语言的路由熵(usage entropy),使token更均匀地分布到共享的、语言无关的专家上,从而缓解路由不均衡问题;同时发现,这种路由改进与下游任务性能提升存在正相关关系,表明路由熵和专家专业化程度可作为衡量MoE系统多语言能力的可解释指标。
链接: https://arxiv.org/abs/2605.17598
作者: Ori Bar Joseph,Smadar Arvatz,Noam Kayzer,Dan Revital,Sarel Weinberger
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models – a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) – using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emphdeep-layer routing collapse: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.
[NLP-78] How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
【速读】: 该论文试图解决的问题是:当前基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习训练框架在大型语言模型(LLM)中依赖低延迟、近在线策略(near-on-policy)的数据收集机制,导致系统开销较大,训练效率低下。其解决方案的关键在于提出 Mu-GRPO 框架,通过将训练划分为少量(如四次)大规模的生成-优化阶段,显著增加数据滚动延迟(rollout staleness),同时大幅降低生成与优化之间的切换开销;为稳定高延迟数据下的学习过程,Mu-GRPO 结合了松弛剪裁(relaxed clipping)以保留有用的历史梯度信息,以及负优势否决机制(negative-advantage veto)以剔除负面优势响应中的不稳定后触发后缀更新,从而在多个数学推理基准上实现了与标准 GRPO 相当或更优的性能,并带来约 2 倍的训练时间加速,显著改善了 LLM 强化学习中的性能-效率权衡。
链接: https://arxiv.org/abs/2605.17570
作者: Minghao Tian,Yunfei Xie,Chen Wei
机构: Rice University (莱斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.
[NLP-79] Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
【速读】: 该论文试图解决的问题是:当前基于棋类数据微调的语言模型(如ChessGPT、C1-4B)在基准测试中表现优异,但其是否真正理解了国际象棋规则或具备专业对弈能力仍存争议;现有研究常将高分数归因于模型的深层认知能力,而忽视了模式匹配(pattern-matching)可能带来的误导性解释。解决方案的关键在于提出KinGPT——一个仅用(局面, 最优着法)对训练的25M参数字符级语言模型,并引入LLM-Modulo框架(即“验证器内循环”),通过外部验证机制显著提升模型生成合法走法和准确走法的能力,从而证明:在结构化领域任务中,通用大语言模型(LLM)结合外部验证器的组合策略,比直接在合成数据上训练更灵活且成本更低,同时能实现接近专用模型的效果。
链接: https://arxiv.org/abs/2605.17565
作者: Ethan Tang
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 4 tables, 3 equations
Abstract:Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B’s best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT’s fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.
[NLP-80] Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
【速读】: 该论文试图解决训练工具调用代理(tool-calling agents)时缺乏大规模带可验证标签轨迹数据的问题。现有方法要么合成与真实API行为偏离的环境,要么生成无真实结果的任务难以验证。其解决方案的关键在于提出FireFly数据生成流水线,通过反转传统合成流程实现标签正确性保障:首先让强语言模型(LLM)在真实API上沿图引导的有向无环图(DAG)结构探索,再从观察到的结果反向合成任务,从而保证标签的构造正确性。为应对真实工具空间规模(约1000个工具),作者构建了成对工具图并采样子DAG以聚焦于语义一致的工作流;同时引入基于检索增强的模拟器缓存所有探索结果,在训练和评估阶段回放,实现完全离线、可复现的强化学习(RL)。该方案最终生成5,144个验证任务,覆盖240个服务器和993个工具,使用GRPO算法在4B参数模型上训练后,在测试集上达到与Claude Sonnet 4.6相当的效果,并在Tau2-Bench、MCPMark和MCP-Atlas等多个基准上取得提升。
链接: https://arxiv.org/abs/2605.17558
作者: Yuxuan Lu,Ziyi Wang,Yingzhou Lu,Yisi Sang,Jiri Gesi,Xianfeng Tang,Yimeng Zhang,Zhenwei Dai,Hui Liu,Hanqing Lu,Chen Luo,Qi He,Benoit Dumoulin,Jing Huang,Dakuo Wang
机构: Northeastern University (东北大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces ( \sim 1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL. Applying this pipeline yields 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.
[NLP-81] CasualSynth: Generating Structurally Sound Synthetic Data
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)生成的合成数据虽然在语义上逼真,但无法保证其输出符合目标领域的因果机制。解决方案的关键在于提出CausalSynth框架,该框架通过将因果结构生成与语义实现解耦,确保合成数据既具备因果有效性又具有语言丰富性。其核心创新包括:1)基于结构因果模型(SCM)和祖先采样生成满足全局马尔可夫性质的因果骨架;2)利用LLM作为约束性“实现器”将骨架映射为高维观测(如临床记录);3)引入迭代一致性验证模块,通过确定性提取检测结构违规并反馈修正,形成闭环优化过程。此外,论文识别出“语义后门”问题——即LLM倾向于用预训练先验覆盖人为施加的因果事实,并证明其迭代机制相比标准拒绝采样能显著降低由此引发的选择偏差。
链接: https://arxiv.org/abs/2605.17528
作者: Zehua Cheng,Wei Dai,Jiahao Sun,Thomas Lukasiewicz
机构: University of Oxford (牛津大学); TU Wien (维也纳工业大学); FLock.io (FLock.io)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages
Abstract:Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emphrealizer, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors – and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal \alpha=0.05 level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.
[NLP-82] Residual Semantic Decomposition of Word Embeddings
【速读】: 该论文试图解决的问题是:如何在保持词向量(word embeddings)中语义关系结构的同时,实现对词义的可解释性分解。现有方法往往难以兼顾嵌入重建精度与语义关系的保留,导致分解结果缺乏语义清晰性和可解释性。解决方案的关键在于提出残差语义分解(Residual Semantic Decomposition, RSD),这是一种神经加法式分解方法,通过递归二分方式提取局部语义轴(local semantic axis),同时利用残差部分揭示未被当前轴吸收的信息。RSD不仅提升了嵌入重构质量,还通过人工指定的成对上下文诊断验证了其对歧义词语义区分的有效性,尽管熵分析表明静态GloVe嵌入中的歧义词并非均匀分布于高熵边界点,因此作者将残差邻域视为定性诊断工具而非精确的词义预测基准。
链接: https://arxiv.org/abs/2605.17482
作者: Seungmin Jin
机构: HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Short paper; includes appendix. Code and data are not included in the arXiv source package
Abstract:We introduce Residual Semantic Decomposition (RSD), a neural additive decomposition of word embeddings that balances embedding reconstruction with relational structure preservation. RSD supports recursive binary decomposition: each K=2 fit extracts a local semantic axis, while residuals expose information not absorbed by that axis. In manually specified paired-context diagnostics over ambiguous words, RSD separates supplied context anchors above shuffled-label controls, but entropy diagnostics show that ambiguous targets are not uniformly high-entropy boundary points in static GloVe. We therefore treat residual neighborhoods as qualitative diagnostics rather than benchmark sense predictions.
[NLP-83] Hybrid Feature Combinations with CNN for Bangla Fake News Classification
【速读】: 该论文旨在解决孟加拉语(Bangla)虚假新闻在互联网和社交媒体平台上的传播问题,这类虚假信息严重威胁真实媒体的可信度。针对现有研究在检测孟加拉语假新闻方面仍存在不足的问题,本文提出通过特征选择方法优化模型性能,具体探索了语义特征、统计特征与字符级特征及其组合在BanFakeNews-2.0数据集上对卷积神经网络(CNN)模型检测效果的影响。研究的关键发现是:将多种特征融合使用时,相较于单一特征,能显著提升召回率(recall)和F1分数(F1-score),从而增强模型对孟加拉语虚假新闻的识别能力。
链接: https://arxiv.org/abs/2605.17481
作者: Md Gulzar Hussain,Babe Sultana,Md Rinku Ali
机构: 未知
类目: Computation and Language (cs.CL)
备注: Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)
Abstract:Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, this https URL_FNews\this http URL.
[NLP-84] VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems
【速读】: 该论文旨在解决大规模语言模型驱动的多智能体系统(LLM-MAS)中因不可靠智能体导致的系统级可靠性问题,尤其是现有方法在故障归因时无法捕捉全局失败模式(如跨步骤不一致性和跨智能体协调错误)且搜索空间过大的局限性。其解决方案的关键在于提出VerifyMAS——一个基于假设验证的故障归因框架:通过构建结构化的失败假设并针对完整交互轨迹进行验证,将归因任务分解为轨迹级错误验证与细粒度智能体定位两个步骤,从而实现“以错误为导向”的归因策略,显著降低搜索复杂度的同时捕获全局失败特征。此外,论文还设计了基于误差分类体系的假设数据构造方法,并微调专用LLM验证器模型用于轨迹级验证和智能体定位,在Aegis-Bench和WhoWhen基准上验证了其对多种骨干模型(包括开源Qwen和API调用GPT模型)的泛化提升能力,且保持长轨迹推理效率。
链接: https://arxiv.org/abs/2605.17467
作者: Hezhe Qiao,Hanghang Tong,Ee-Peng Lim,Bing Liu,Guansong Pang
机构: Singapore Management University (新加坡管理大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: 22 pages
Abstract:Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and WhoWhen show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.
[NLP-85] rust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
【速读】: 该论文试图解决的问题是:当前基于工具的大型语言模型(LLM)代理在依赖外部工具做重要决策时,现有安全评估基准和防御机制普遍假设工具反馈一旦被选择就是可信的,而忽略了“认知中毒”(cognitive poisoning)这一新型攻击模式——即恶意工具在探索阶段表现合理、通过看似无害的反馈积累信任,仅在特定隐藏状态条件满足时才触发有害行为。解决方案的关键在于提出一种轨迹感知的风险评分框架 VISTA-Guard,其核心思想是将多步工具交互抽象为编码信任形成动态的结构化环境变量,并基于此轨迹条件表示对最终执行动作进行风险评分。实验表明,传统提示中心启发式方法、标量特征融合及零样本判别器在此场景下失效,而VISTA-Guard在域内任务中达到84.2的GuardedJoint得分,在平衡的域外迁移测试中仍保持56.9的性能,显著优于仅优化安全或效用单一维度的方法,验证了将信任形成过程建模为交互轨迹动态是黑箱工具生态系统中代理安全防御的核心。
链接: https://arxiv.org/abs/2605.17453
作者: Lecheng Yan,Ruizhe Li,Xicheng Han,Wenxi Li,Binwu Wang,Longyue Wang,Chenyang Lyu,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); University of Science and Technology of China (中国科学技术大学); University of Birmingham (伯明翰大学); Zhejiang University (浙江大学); East China Normal University (华东师范大学); Alibaba Group (阿里巴巴集团)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into structured environment variables that encode trust-formation dynamics and then score the risk of the final executable action from this trajectory-conditioned representation. Experiments show that prompt-centric heuristics, scalarized features, and zero-shot judges fail in this regime, whereas trajectory-aware final-action scoring yields strong in-domain discrimination and remains effective under balanced out-of-distribution transfer. Under GuardedJoint, VISTA-Guard reaches 84.2 in-domain and 56.9 on balanced out-of-distribution evaluation, while methods that optimize only one side of the safety–utility tradeoff collapse to zero. These findings support a broader view of agent security in black-box tool ecosystems: the decisive defense target is not local prompt text or tool descriptors alone, but the way trust is formed across the interaction trajectory and committed through the final action.
[NLP-86] ContraFix: Agent ic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)代理在自动化漏洞修复(Automated Vulnerability Repair, AVR)中因语义理解错误而导致的修复失效问题,即代理常选择与漏洞根本原因不匹配的修复方向。解决方案的关键在于提出ContraFix框架,其通过耦合差分运行时证据与可复用的修复技能来实现精准定位和高效修复:首先,Mutator模块构造PoC变体以跨越故障边界;Analyzer模块在故障区域插入状态探针并提取崩溃与非崩溃执行间的差异,形成修复规范;Patcher模块将规范转化为验证后的源码补丁;每次成功修复后,系统更新包含修复规范和变异策略的双轨技能库,并通过三层策略检索复用,从而避免重复诊断。
链接: https://arxiv.org/abs/2605.17450
作者: Simiao Liu,Fang Liu,Li Zhang,Yang Liu,Yinghao Zhu
机构: Beihang University (北京航空航天大学); The University of Hong Kong (香港大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.
[NLP-87] Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback
【速读】: 该论文试图解决当前生成式CAD(计算机辅助设计)模型在工业设计流程中存在的重要缺陷:现有方法将CAD生成拆分为独立的零件合成与装配步骤,缺乏工程师迭代优化的能力,且未考虑工程实际所需的物理验证(如有限元分析,FEA)。解决方案的关键在于提出一个更贴近工业实践的任务范式——从自由文本工程说明直接生成完整的多部件STEP文件,并通过FEA进行结构有效性验证;同时引入两种新型监督信号:一种基于文本的蓝图schema用于指导生成逻辑,以及21视角图像渲染器辅助模型视觉校验。实验表明,这些改进显著提升了几何重建质量(如Box-IoU指标提升),并推动生成结果向兼具视觉合理性与工程可执行性的方向发展。
链接: https://arxiv.org/abs/2605.17448
作者: Guijin Son,Jehyun Park,Seyeon Park,Sunghee Ahn,Youngjae Yu
机构: Seoul National University(首尔国立大学); OneLineAI; Sungkyunkwan University(成均馆大学); VF Space
类目: Graphics (cs.GR); Computation and Language (cs.CL)
备注: Work in progress
Abstract:Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent’s visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.
[NLP-88] FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理密集文档图像时因大量视觉标记(visual tokens)导致的推理成本过高问题。现有剪枝方法依赖于物理删除(如在预填充阶段永久丢弃视觉标记),但在光学字符识别(OCR)任务中会因不可逆的信息丢失而导致精度灾难性下降,因为每个视觉标记可能对应一个字符或结构元素。论文提出的关键解决方案是基于“动态视觉注视”(Dynamic Visual Fixation)现象:尽管文档图像整体上看似密集且难以剪枝,但模型在每个解码步骤中仅关注一个小区域,并随时间逐步移动——类似于人类阅读时逐词注视的过程。为此,作者设计了无需训练的FastOCR框架,包含两个互补模块:Focal-Guided Pruning通过识别关键层并动态选择每步中最相关的视觉标记来减少注意力计算;Cross-Step Fixation Reuse则利用注视区域的渐进移动特性,用前一步的结果作为当前步骤的热启动(warm-start)。该方案不永久移除任何标记,而是动态调整注意力分配,从而避免信息损失。实验表明,FastOCR可作为即插即用加速模块,在五种不同规模和架构的VLM上保持一致性能,例如在Qwen2.5-VL上仅需5%的视觉标记即可保留98%原始准确率,同时将注意力延迟降低3.0倍。
链接: https://arxiv.org/abs/2605.17447
作者: Zihan Tang,Leqi Shen,Hui Chen,Ao Wang,Ben Wan,Yan Feng,Ke Zhang,Sicheng Zhao,Tongxuan Liu,Guiguang Ding
机构: Tsinghua University (清华大学); JD.com
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model’s attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model’s accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0 \times .
[NLP-89] MemRepair: Hierarchical Memory for Agent ic Repository-Level Vulnerability Repair
【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Model, LLM)的自动化漏洞修复(Automated Vulnerability Repair, AVR)系统在处理复杂、多文件的修复任务时,因缺乏持续记忆机制而难以复用历史修复经验或从失败中学习,导致修复效果受限。解决方案的关键在于提出MemRepair——一个基于记忆增强的代理框架,通过引入三个互补的记忆层(History-Fix、Security-Pattern和Refinement-Trajectory)与动态反馈驱动的迭代优化循环,使代理能够检索特定仓库的修复惯例、应用可复用的安全模式,并利用“失败到成功”的轨迹修正语义无效的补丁,从而显著提升跨语言和多文件场景下的修复可靠性与成功率。
链接: https://arxiv.org/abs/2605.17444
作者: Simiao Liu,Li Zhang,Fang Liu,Xiaoli Lian,Yang Liu,Yinghao Zhu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern software ecosystems face a rapidly growing number of disclosed vulnerabilities, increasing the need for automated repair techniques that can operate reliably at repository scale. Although Large Language Model (LLM)-based agents have recently shown promise for automated vulnerability repair (AVR), most existing systems still treat repair as a single generation step over the currently visible code context. As a result, they lack a persistent mechanism for reusing prior fixes or learning from failed validation attempts, which limits their effectiveness on complex, multi-file repair tasks. We present MemRepair, a memory-augmented agentic framework that formulates vulnerability repair as an iterative, experience-driven process. MemRepair combines three complementary memory layers, i.e., History-Fix, Security-Pattern, and Refinement-Trajectory memories, with a dynamic feedback-driven refinement loop. This design allows the agent to retrieve repository-specific repair conventions, apply reusable security defenses, and exploit prior “failure-to-success” trajectories to revise semantically invalid patches based on runtime evidence. We evaluate MemRepair on three representative repository-level vulnerability repair benchmarks: SEC-Bench, PatchEval (Python, Go, JavaScript), and the C++ subset of Multi-SWE-bench. MemRepair achieves state-of-the-art resolution rates of 58.0%, 58.2%, and 30.58%, respectively, outperforming strong general-purpose agents such as OpenHands and SWE-agent, as well as the specialized AVR tool InfCode-C++, while maintaining competitive repair cost. These results show that persistent, hierarchical repair memory can substantially improve the reliability of agentic vulnerability repair across diverse languages and repository settings.
[NLP-90] Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades
【速读】: 该论文试图解决的问题是:在韩语口语问答(SQA)任务中,自动语音识别(ASR)错误如何通过ASR-大语言模型(LLM)级联结构传播,并导致下游语义层面的失败,而这类失败无法被传统的ASR评估指标充分捕捉。解决方案的关键在于:首先,发现不同性能的LLM在级联系统中对ASR错误的相对下游退化程度一致,表明级联性能下降主要由ASR阶段的信息损失决定;其次,识别出单字级韩语ASR错误是一种独特的语义失效通道,即使转录差异极小,也可能导致正确答案完全从下游预测中消失;最后,通过对比实验验证,直接使用音频输入的大语言模型(Audio Language Model, ALM)在噪声环境下优于基于ASR-LLM的流水线架构,证明了采用直接音频输入可有效缓解因转录引入的信息损失问题。
链接: https://arxiv.org/abs/2605.17443
作者: Donghyuk Jung,Youngwon Choi
机构: Korea Institute of Culture Technology (韩国文化技术研究院); Maum AI Inc. (Maum AI公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint. Submitted to APSIPA ASC 2026
Abstract:We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.
[NLP-91] Medical Context Distorts Decisions in Clinical Vision Language Models
【速读】: 该论文试图解决的问题是:当前视觉-语言模型(Vision-Language Models, VLMs)在临床决策支持中的可靠性尚未得到充分验证,尤其是在需要整合医学记录中图像与文本上下文的现实场景下,其性能表现存在不确定性。解决方案的关键在于系统性地识别和量化三种失效模式:(1)对文本模态的过度依赖,即使图像中有明确证据;(2)对无关临床病史的虚假依赖;(3)对语义等价输入的提示敏感性。研究通过在MIMIC-CXR数据集上操纵图像-文本对齐、临床病史内容及提示格式,发现多数VLM的决策受文本主导,且极易被无关报告干扰或因微小提示变化而反转正确判断,从而强调了在临床部署前必须引入显式防护机制和压力测试的重要性。
链接: https://arxiv.org/abs/2605.17436
作者: David Restrepo,Ira Ktena,Maria Vakalopoulou,Stergios Christodoulidis,Enzo Ferrante
机构: CentraleSupélec - Université Paris-Saclay (巴黎萨克雷大学); IHU PRISM (PRISM研究所); National Institute in Precision Oncology (精准肿瘤学国家研究所); Gustave Roussy (古斯塔夫·鲁西癌症中心); INSERM (法国国家健康与医学研究院); Ellison Institute of Technology (EIT) (埃利森技术研究所); CONICET (阿根廷国家科学技术研究委员会); Universidad de Buenos Aires (布宜诺斯艾利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.
[NLP-92] BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering
【速读】: 该论文试图解决的问题是:在生物医学问答任务中,从检索到的文献中提取的信息存在相关性、质量和对候选答案的支持度不一致的问题,而现有的检索增强型大语言模型(Retrieval-Augmented Large Language Model, RAG-LM)方法通常将文献作为扁平文本输入模型,导致证据可靠性与剩余不确定性难以显式建模。解决方案的关键在于提出 BELIEF 框架——一种结构化的证据建模与不确定性感知融合机制。BELIEF 将检索到的文档转化为包含临床属性、来源质量、问题相关性、支持强度及对应假设的证据对象(evidence objects),并基于此构建两条互补推理路径:符号路径利用 Dempster–Shafer 理论生成可靠性加权的基本概率分配,并进行不确定性感知的符号证据融合以估计信念和残余不确定性;神经路径则利用相同结构化证据进行 LLM 的语义推理,同时通过一个可靠性感知仲裁模块根据信念强度、不确定性、证据可靠性和语义一致性协调符号与神经输出。实验表明,BELIEF 在多个数据集和模型设置下显著优于基线方法,其核心优势在于显式建模证据结构、路径分歧与决策不确定性,从而提升对检索证据的利用率。
链接: https://arxiv.org/abs/2605.17435
作者: Chang Zong,Hao Ning,Siliang Tang,Jie Huang,Jian Wan
机构: Zhejiang University of Science and Technology (浙江科技学院); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures
Abstract:Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster–Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone–dataset–metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.
[NLP-93] MiniGPT : Rebuilding GPT from First Principles
【速读】: 该论文试图解决的问题是如何以简洁、可复现的方式从零实现一个基于GPT架构的自回归语言模型,从而帮助研究人员和学习者深入理解Transformer模型的核心组件及其训练流程。解决方案的关键在于构建一个完整的、端到端的PyTorch实现(MiniGPT),其特点包括:独立编写且集中于单个笔记本中的模型与训练代码;完整实现token嵌入、位置编码、因果多头自注意力机制、预LayerNorm结构、残差连接、前馈MLP层以及基于教师强制(teacher forcing)的下一个词预测训练;同时在Tiny Shakespeare字符级数据集上进行评估,验证了不同参数规模模型的训练收敛性和生成质量,展示了从原始文本到可生成具有结构化风格文本的完整路径。
链接: https://arxiv.org/abs/2605.17398
作者: Jibin Joseph
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 2 figures
Abstract:This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.
[NLP-94] QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI
【速读】: 该论文试图解决生成式 AI(Generative AI)在开放性、创造性及面向人类的任务中,现有评估方法存在根本性局限的问题。传统自动指标依赖表面统计相似性,难以反映人类对质量的感知;纯人工评估虽可靠但成本高、主观性强且难以扩展;而基于大语言模型(LLM)的评估方法虽提升了可扩展性,却常缺乏人类定义的评价原则支撑,导致偏差和不一致。论文提出的关键解决方案是Quantifying Qualitative Judgment(QQJ),其核心在于将“质量定义”与“执行评估”分离:通过专家设计的多维评分量表锚定评价标准,并利用少量高质量标注数据校准LLM评估器,使其推理过程与专家判断对齐。这一设计实现了跨任务和模态的一致性、可解释性和可扩展性评估,实验表明QQJ显著优于传统自动指标和无约束LLM评估,在人类判断一致性、重复评估稳定性以及识别幻觉和意图错位等关键失败模式方面表现更优,从而为现代生成式AI系统提供了可靠、可操作且以人为本的评估基础。
链接: https://arxiv.org/abs/2605.17382
作者: Marjan Veysi,Pirooz Shamsinejadbabaki,Mohammad Zare,Mohammad Sabouri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注:
Abstract:The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.
[NLP-95] Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization ACL
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在特定领域(如法律和医学)文本摘要任务中因通用预训练语料库导致的分词效率低下问题,特别是由于词汇表不匹配引发的大量未登录词(Out-of-Vocabulary, OOV)现象。解决方案的关键在于提出一种目标导向的、参数高效的领域适配方法:通过在预训练分词器中引入领域特定词汇,并选择性替换低训练质量或无法被模型访问的token,从而在控制参数增长的同时提升模型对专业术语的表达能力。实验表明,该方法显著改善了生成摘要与参考摘要之间的语义相似度,并增强了摘要的连贯性、相关性和忠实性,同时相比持续预训练减少35–55%训练时间,相比仅扩展词汇的方法降低高达37%的参数量。
链接: https://arxiv.org/abs/2605.17379
作者: Gunjan Balde,Soumyadeep Roy,Mainack Mondal,Niloy Ganguly
机构: IIT Kharagpur (印度理工学院克哈格普尔分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper
Abstract:Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by 35-55% over continual pretraining and reduce parameter counts up to 37% w.r.t expansion-only methods. We make the codebase publicly available at this https URL.
[NLP-96] Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
【速读】: 该论文旨在解决多智能体大语言模型(Multi-Agent LLM)在跨域复杂推理任务中因每次查询都从头搜索协作拓扑结构而导致的在线搜索开销大、推理时token消耗高以及可扩展性差的问题。其解决方案的关键在于提出一种名为TopoPrior的框架,通过离线学习可迁移的拓扑先验(topology prior),并在在线推理时基于查询条件生成初始协作图以供下游优化。该方法将部分拓扑搜索成本从每次查询的在线优化转移到离线先验学习阶段,从而实现计算资源的摊销,并保持与现有拓扑演化方法的兼容性。技术上,TopoPrior包含两个核心组件:一是基于条件变分图框架的可迁移拓扑先验学习模块,用于在潜在空间中捕捉跨域的通用结构规律;二是查询条件化的潜在适应模块,引入对抗对齐机制以减少不必要的领域差异并保留与查询相关的结构变化。实验表明,TopoPrior在多个跨域推理基准上显著提升了多种异构拓扑演化基线的性能,同时大幅降低在线推理时的token使用量,仅需少量额外训练参数,验证了可迁移拓扑初始化是提升多智能体LLM协作效率的有效且轻量的方法。
链接: https://arxiv.org/abs/2605.17359
作者: Taolin Zhang,Zijie Zhou,Jiuheng Wan,Tingyuan Hu,Chengyu Wang,Xiaofeng He,Richang Hong
机构: Hefei University of Technology (合肥工业大学); China University of Petroleum (北京) (中国石油大学(北京)); East China Normal University (华东师范大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.
[NLP-97] HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction
【速读】: 该论文试图解决的问题是:现有文本驱动的人格预测方法大多依赖于浅层或单一层次的文本表示,忽略了语言固有的多层次结构(如文档级全局特征、句子级局部语义和词级细粒度信息),从而限制了人格推断的准确性与人类认知的一致性。解决方案的关键在于提出HyperPersona框架,其核心创新是通过超图(hypergraph)结构显式建模文本的多层级组织——将文档和句子作为超边、词汇作为节点,实现对全局、局部及词法依赖关系的联合建模;并进一步引入基于Transformer的图编码器,学习跨层级的语言交互,生成具有上下文敏感性和结构基础的特征表示,从而显著提升人格预测性能。
链接: https://arxiv.org/abs/2605.17355
作者: Sina Heydari,Majid Ramezani
机构: Institude for Advanced Studies in Basic Sciences (IASBS)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Submitted to Artificial Intelligence (Elsevier)
Abstract:As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.
[NLP-98] AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型问答任务中生成事实不一致回答的问题,核心挑战在于幻觉现象以及模型对长尾知识缺口的覆盖能力不足。解决方案的关键在于提出AMATA框架——一种自适应多智能体轨迹对齐机制,通过动态整合外部知识来提升回答的可解释性和事实准确性。其创新性体现在两个方面:一是引入轨迹内偏好学习(Intra-Trajectory Preference Learning),基于目标导向的偏好选择关键智能体以优化推理路径;二是提出跨智能体依赖学习(Inter-Agent Dependency Learning),利用新型依赖感知的直接偏好优化方法建模智能体间的工具依赖关系。实验证明,AMATA在五个主流知识密集型问答基准上均显著优于基线模型和现有知识增强系统,并在减少token消耗方面展现出高效性。
链接: https://arxiv.org/abs/2605.17352
作者: Taolin Zhang,Dongyang Li,Chen Chen,Qizhou Chen,Jiuheng Wan,Xiaofeng He,Chengyu Wang,Richang Hong
机构: Hefei University of Technology (合肥工业大学); Shanghai University of Electric Power (上海电力大学); East China Normal University (华东师范大学); Guangdong University of Finance and Economics (广东财经大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.
[NLP-99] aming "Zombie Agents : A Markov State-Aware Framework for Resilient Multi-Agent Evolution
【速读】: 该论文试图解决的问题是:当前基于大语言模型(LLM)的多智能体系统在协作过程中,常通过激进的图结构演化(如节点或边的硬剪枝)来提升效率,但这种方法容易因瞬时问题(如幻觉或临时知识缺口)过早淘汰具有潜力的智能体,忽视了这些“僵尸”智能体(zombie agents)在未来可能恢复并作出贡献的可能性。解决方案的关键在于提出AgentRevive框架,其核心创新是引入马尔可夫状态感知机制,实现智能体协作的软状态迁移:一是通过状态感知策略学习,将智能体分为“活跃”、“待机”和“终止”三类,并基于记忆信息选择性传播消息,同时利用风险估计器优化状态转移以降低不可靠节点的影响;二是通过状态感知的边优化策略,在保留待机状态智能体的同时永久移除终止状态节点,从而在保证任务性能的前提下显著减少Token消耗,实现更鲁棒且高效的多智能体演化。
链接: https://arxiv.org/abs/2605.17348
作者: Taolin Zhang,Pukun Zhao,Qizhou Chen,Jiuheng Wan,Chen Chen,Xiaofeng He,Chengyu Wang,Richang Hong
机构: Hefei University of Technology (合肥工业大学); Guangdong University of Finance and Economics (广东财经大学); Alibaba Group (阿里巴巴集团); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into Active’‘, Standby'', and Terminated’’ states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing Terminated'' nodes and retaining Standby’’ nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.
[NLP-100] ransitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment ICML2026
【速读】: 该论文试图解决标准强化学习人类反馈(RLHF)方法依赖传递性标量奖励信号而无法捕捉人类偏好中循环特性的问题。现有模型如通用偏好模型(GPM)虽尝试建模循环偏好,但其隐式表述将层次结构与循环性纠缠在一起,导致无法保证最优解的存在。解决方案的关键在于提出混合奖励-循环(HRC)模型,通过博弈论分解方法显式地将偏好解耦为正交的传递性(标量)和循环性(向量)分量;同时引入动态自对弈偏好优化(DSPPO),将对齐过程建模为时变博弈,逐步引导策略收敛至纳什均衡。实验表明,HRC在合成数据和RewardBench 2上的表现优于BT与GPM基线,尤其在“平局”领域展现出更强鲁棒性,且在AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench等下游任务中显著优于SPPO基线,验证了其有效性与泛化能力。
链接: https://arxiv.org/abs/2605.17342
作者: Yucong Huang,Xiucheng Li,Kaiqi Zhao,Jing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC’s structural superiority in mixed transitive–cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model’s robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at this https URL.
[NLP-101] Weak-to-Strong Elicitation via Mismatched Wrong Drafts
【速读】: 该论文试图解决的问题是:如何通过更高效的方式提升强学习模型(如Mathstral-7B)在数学推理任务上的性能,尤其是在传统基于策略强化学习(on-policy RL fine-tuning,例如GRPO)难以达到最优效果时。解决方案的关键在于引入来自较弱但领域训练更充分的模型(Qwen2.5-Math-1.5B)所产生的“错误草稿”(mathematically wrong drafts),并将这些不匹配当前问题的草稿注入到强模型的GRPO训练过程中——这种“错位混合”策略反而显著优于标准的匹配或正确草稿方式。实验表明,这种不匹配的错误样本能有效激发强模型的能力,在MATH-500和AIME 2025/2026等测试集上均取得领先结果,且整个训练流程仅需单卡、无需监督微调(SFT)、奖励模型或合成数据,实现了高效率与高性能的结合。
链接: https://arxiv.org/abs/2605.17314
作者: Wei Deng
机构: Mistral AI (Mistral AI); Yang et al. [2024]
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model – mismatched to the current problem – into a stronger learner’s GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3–5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields +1.62 pp on MATH-500 (greedy pass@1) over the matched-wrong variant ( n=10 seeds, p=0.0015 , Welch’s t ). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@ k . On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@ k above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from k=1 to k=1024 across 2 seeds ( +14.2 pp on 2025 and +9.0 pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe – trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop – reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).
[NLP-102] CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)自纠错能力缺乏系统性和理论保障的问题,即当前方法多依赖于通用提示(如“请重新考虑你的答案”),无法进行结构化错误诊断或保证收敛性。其解决方案的关键在于将自纠错过程建模为一个闭环控制系统(closed-loop control system),引入三模态误差检测器(结合自一致性、置信度表述和逻辑链验证)、类型导向的纠错控制器(根据错误类别生成针对性修复指令)以及基于控制论稳定性准则的收敛判断机制,从而实现可量化、可控制的纠错过程。该框架在自建的CyberCorrect-Bench基准上验证了有效性,最终准确率达到79.8%,较现有最优方法提升6.2个百分点,同时通过收敛控制机制将过度纠错(overshoot)减少41%。
链接: https://arxiv.org/abs/2605.17305
作者: Yuning Wu,Yingmin Liu,Yang Shu
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 1 figure, submitted to IEEE SMC 2026
Abstract:Large language model (LLM) self-correction – the ability to detect and fix errors in generated outputs – remains largely ad hoc, relying on generic prompts such as “please reconsider your answer” without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics – convergence rate, overshoot rate, and oscillation rate – that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.
[NLP-103] Compress the Context Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression
【速读】: 该论文试图解决大语言模型(LLM)在长对话中因上下文长度限制而面临的核心问题:现有上下文管理方法(如截断、摘要或记忆系统)通常仅关注token数量的压缩,却忽视了对话中关键语义承诺(semantic commitments)的保留与可验证性。解决方案的关键在于提出一种名为Context Codec的承诺层级框架,将对话状态建模为具有明确类型、来源溯源、等价关系、冲突检测、置信度和证据跨度的语义原子(semantic atoms),并分离出提取、归一化、表示、渲染和验证五个独立模块。该框架引入了Critical Atom Recall、Commitment Density等量化指标,定义了语义压缩错误分类体系,并设计了Context Compression Language (CCL) 用于高效且可审计的编码方式,从而实现“压缩对话,保留承诺”的目标,使上下文压缩过程具备可测量、可验证性和安全性保障。
链接: https://arxiv.org/abs/2605.17304
作者: Natalia Trukhina,Vadim Vashkelis
机构: Embedded Intelligence Lab (EMILAB)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:LLM context is not just tokens; it is a set of commitments. Long-running conversations accumulate goals, constraints, decisions, preferences, tool results, retrieved evidence, artifacts, and safety boundaries that future responses must preserve. Existing context-management methods reduce length through truncation, retrieval, summarization, memory systems, or token-level prompt compression, but they rarely specify which semantic commitments must survive compression or how their preservation should be measured. We propose Context Codec, a commitment-level framework for compressing prompts and chat histories. Context Codec represents dialogue state as typed, source-grounded semantic atoms with canonical identity, equivalence, conflict, confidence, risk, and evidence spans. It separates five concerns - extraction, normalization, representation, rendering, and verification - and introduces metrics for Critical Atom Recall, Weighted Atom Recall, Commitment Density, and round-trip recoverability. It also defines a taxonomy of semantic compression errors, a concrete normalization procedure, conservative fallback rules for low-confidence and safety-critical atoms, and Context Compression Language (CCL), an ASCII-first compact rendering of canonical JSON atoms. In a small diagnostic study, CCL-Core occupies a useful middle ground between structured prose and JSON: more explicit and auditable than prose, usually more compact than JSON, and less risky than heavily minified notation. The result is not a claim that shorthand solves compression, but a framework for making context compression verifiable: compress the conversation, keep the commitments. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.17304 [cs.LG] (or arXiv:2605.17304v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-104] ConflictRAG : Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation
【速读】: 该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中一个关键问题:即假设检索到的文档之间具有内在一致性,而这一假设在实际应用中常常失效,导致生成答案出现知识冲突。解决方案的关键在于提出 ConflictRAG 框架,其核心创新包括:(1) 一种两阶段冲突检测模块,结合轻量级嵌入式 MLP 分类器与选择性大语言模型(Large Language Model, LLM)精炼机制,在降低 62% API 调用成本的同时保持 90.8% 的检测准确率;(2) 基于熵权 TOPSIS 的数据驱动源可信度评估方法,相较人工启发式规则提升 7.1% 的选择准确性;(3) 引入冲突感知的 RAG 评分(Conflict-Aware RAG Score, CARS),用于诊断和量化系统处理知识冲突的能力。实验表明,该框架在三个基准测试上实现 88.7% 的冲突检测 F1 分数,并在正确性上持续优于最强基线 5.3–6.1%,且具备跨不同基础大模型的良好迁移能力。
链接: https://arxiv.org/abs/2605.17301
作者: Chenyu Wang,Yingmin Liu,Yang Shu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, submitted to IEEE SMC 2026
Abstract:Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents – an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3–6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.
[NLP-105] DISA: Offline Importance Sampling for Distribution-Matching LLM -RL
【速读】: 该论文试图解决的是强化学习(RL)在生成式推理任务中因标准奖励最大化方法导致的解空间模式坍缩问题,即模型倾向于收敛到单一高奖励路径,而忽视其他有效解路径。解决方案的关键在于提出一种名为DISA(Decoupled Importance-Sampled Anchoring)的新框架,其核心创新是将分布匹配(distribution-matching)所需的轨迹空间分区函数(partition function)估计从强化学习循环中解耦出来:DISA通过离线采样提案轨迹并利用重要性采样估算分区函数,随后冻结该估计值再进行策略优化。这种解耦机制不仅保持了分布匹配的目标,还实现了数据、梯度、损失和诊断层面的完全分离,从而显著提升了策略学习的稳定性与多样性,在数学和代码基准测试中均优于现有方法,且策略层级多样性明显优于奖励最大化基线。
链接: https://arxiv.org/abs/2605.17295
作者: Shaobo Wang,Yujie Chen,Yafeng Sun,Wenjie Qiu,Zhihui Xie,Sihang Li,Yucheng Li,Huiqiang Jiang,Xingzhang Ren,Xuming Hu,Dayiheng Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州) ); The University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学); The University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 7 figures, 7 tables. Abstract shortened to respect the arXiv limit of 1920 characters. Please see the PDF for the full abstract
Abstract:Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.
[NLP-106] OProver: A Unified Framework for Agent ic Formal Theorem Proving
【速读】: 该论文试图解决的问题是:当前形式化定理证明(formal theorem proving)中,代理式推理(agentic proving)通常仅在推理阶段使用,而未被整合进训练过程,导致模型难以从失败的证明尝试中学习和改进。解决方案的关键在于提出 OProver,一个统一的框架,将代理式推理深度嵌入到 Lean 4 的训练流程中:通过迭代式的预训练与后训练机制,利用编译器验证的证明(compiler-verified proofs)和 Lean 编译器反馈来修正失败的证明路径;同时构建大规模知识库 OProofs(包含 1.77M 个 Lean 命题、6.86M 条编译器验证的证明及带上下文的修复轨迹),并在每轮迭代中将新生成的可验证证明加入检索记忆体,并以修复轨迹作为监督微调(SFT)数据、未解决难题用于强化学习(RL)。这一设计使模型能够持续从失败中学习,显著提升整体证明成功率。
链接: https://arxiv.org/abs/2605.17283
作者: David Ma,Kaijing Ma,Shawn Guo,Yunfeng Shi,Enduo Zhao,Jiajun Shi,Zhaoxiang Zhang,Gavin Cheung,Jiaheng Liu,Zili Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.
[NLP-107] FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers
【速读】: 该论文试图解决的问题是:当前激活调控(activation steering)方法普遍假设语言模型的激活空间为欧几里得空间(Euclidean),但这一假设在实际中并不成立,导致调控效果受限。解决方案的关键在于提出一种基于模型自身输出行为所诱导的非欧几何结构——即通过软最大层(softmax layer)的费舍尔信息度量(Fisher information metric)沿后续层雅可比矩阵(Jacobian)拉回得到的度量,并由此推导出一个闭式最优调控方向(closed-form optimal direction)。该方法称为FishBack,其核心创新在于无需进行流形拟合或数据驱动的几何估计,即可在每一步迭代中找到最小失真方向,从而显著提升调控精度。实验表明,在GPT-2上,FishBack在多个语法概念和层上均优于传统欧氏基线方法,且性能差距可通过单一谱诊断指标(隐含度量代价与费舍尔最优代价之比)定量预测。
链接: https://arxiv.org/abs/2605.17231
作者: Sihan Wang,Jiayi Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint. 20 pages, 9 figures, 5 tables
Abstract:Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model’s own output behavior – the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers – deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2–17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods – CAA, ActAdd, ITI, and others – each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric’s cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of 1.3\times – 2.5\times relative to Euclidean gradient ascent and 1.5\times relative to CAA at matched concept probability.
[NLP-108] Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making
【速读】: 该论文试图解决的问题是:前沿大语言模型(LLMs)在处理包含污名化语言(stigmatizing language, SL)的临床文本时,是否会继承并传播人类偏见,从而影响临床决策的公平性和鲁棒性。解决方案的关键在于系统性评估九个前沿LLM在四种存在污名化的医学条件下对不同强度和类型(怀疑、责备、恶意中伤)的SL的响应,并测试链式思维(Chain-of-Thought, CoT)推理和模型自我去偏等标准提示缓解策略的有效性。研究发现,所有模型均表现出显著偏见,且对语言框架高度敏感——仅一句SL即可改变模型输出,呈现明确的剂量-反应关系;而现有缓解策略效果有限,模型难以显式识别SL但仍受其隐式影响,揭示了当前LLMs在临床自然语言处理中存在严重的公平性与鲁棒性漏洞,亟需建立严格的算法防护机制以防止健康不平等的自动化加剧。
链接: https://arxiv.org/abs/2605.17228
作者: Jen-tse Huang,Didi Zhou,Faith Kamau,Amy Oh,Anne R. Links,Mark Dredze,Mary Catherine Beach,Somnath Saha
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages
Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.
[NLP-109] ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在处理化学反应图示时存在的显著能力差距问题,具体表现为两个关键瓶颈:一是“视觉缺陷”(Visual Deficit),即通用视觉编码器难以准确解析密集分子图的拓扑连接关系;二是“语义断层”(Semantic Disconnect),即标准线性字符串表示(如SMILES)无法有效激活模型的潜在化学推理能力。解决方案的关键在于提出化学视觉激活框架(Chemical Visual Activation, ChemVA),其核心包括两个创新机制:首先通过混合粒度检测的“视觉锚定”(Visual Anchor)机制,将功能基团与视觉特征精准对齐;其次采用语义对齐策略,将视觉特征转化为实体名称以最大化LLMs中的知识激活。实验表明,该方法在新构建的OCRD-Bench数据集上实现了92.0%的结构识别准确率,并在9种不同LLMs上平均提升约20个百分点,使开源模型在复杂化学推理任务中达到与专有SOTA系统相当的性能。
链接: https://arxiv.org/abs/2605.17214
作者: Mingyang Rao,Kehua Feng,Zhihui Zhu,Jiangzhen Fu,Hao Yu,Keyan Ding,Huajun Chen
机构: Zhejiang University (浙江大学); Fudan University (复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model’s latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.
[NLP-110] LLM s for automatic annotation of Mandarin narrative transcripts
【速读】: 该论文旨在解决非英语口语语料中复杂话语层面标注(如叙事宏观结构)的劳动密集型与耗时问题,尤其关注大语言模型(LLMs)在跨年龄群体普通话口语叙事中的适用性。其解决方案的关键在于:利用多语言叙事评估工具(MAIN)作为测试基准,系统比较四种LLMs与受训人类标注者在儿童、青年和老年人口语叙事中的标注一致性,结果表明最优模型在叙事语法元素层级结构标注上达到与人类标注者相当的可靠性(κ = 0.794),同时将标注时间减少65%;但本地部署的轻量级模型表现显著较差,且模型对需细微语义区分的宏结构类别及词汇变异性高、语义模糊性强的青年成人叙事存在识别困难。这说明LLMs可在非英语口语语料中有效辅助话语级标注,但仍需人类监督处理语义复杂的任务。
链接: https://arxiv.org/abs/2605.17205
作者: Qingwen Zhao,Hongao Zhu,Yunqi He,Rui Wang,Aijun Huang,Hai Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 9 tables
Abstract:Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.
[NLP-111] PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media ACL2026
【速读】: 该论文试图解决的问题是:如何在多元化的社交媒体社区中实现有效的AI内容审核,这些社区由用户自治、各自制定规范,导致同一内容在不同社区中可能被视为违规或合规。解决方案的关键在于构建一个名为PluRule的多模态、多语言基准测试集,用于评估AI模型在识别1,989个Reddit社区中的13,371条规则违反行为的能力,其任务形式模拟了人类审核员的实际操作——给定评论及其上下文,判断具体违反哪一条规则。研究发现,即使是最先进的视觉-语言模型(如GPT-5.2)在该任务上表现也仅略优于基线,且模型规模和上下文长度的增加带来的收益有限,这表明对多元化社区的内容审核仍是当前语言模型面临的一项根本性挑战。
链接: https://arxiv.org/abs/2605.17187
作者: Zoher Kachwala,Bao Tran Truong,Rasika Muralidharan,Haewoon Kwak,Jisun An,Filippo Menczer
机构: Indiana University (印第安纳大学); TUD Dresden University of Technology (德累斯顿工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to ACL 2026 Main Conference
Abstract:Social media are shifting towards pluralism – community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.
[NLP-112] Why Do Safety Guardrails Degrade Across Languages?
【速读】: 该论文试图解决大语言模型在非英语语言中安全性下降的问题,特别是现有评估方法(如Jailbreak Success Rate, JSR)将多种安全驱动因素混杂在一起,导致难以识别具体的安全失效原因。其解决方案的关键在于提出一种潜在变量模型——多组项目反应理论(Multi-Group Item Response Theory, Multi-Group IRT)框架,能够解耦四个核心安全驱动因素:跨语言的通用安全鲁棒性(θ)、提示内在难度(β)、全球语言处理难度(γ)以及提示特定的跨语言安全差距(τ)。通过在MultiJail数据集上对61种模型配置在10种不同资源水平的语言中进行评估(共190万条样本),研究发现安全性主要呈单维结构,且低资源语言并非普遍更脆弱,反而某些模型在英语中比在低资源语言中更易被攻破;此外,高安全差距(τ)的提示集中于物理伤害类别(如盗窃、武器),且与低资源语言显著相关。该框架在预测安全拒绝行为时达到AUC=0.940,优于简单基线,揭示了聚合指标所掩盖的概念-语言脆弱性,从而支持更公平的跨语言安全评估和针对性的数据构建改进。
链接: https://arxiv.org/abs/2605.17173
作者: Max Zhang,Ameen Patel,Sang T. Truong,Sanmi Koyejo
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ( \theta ), intrinsic prompt hardness ( \beta ), global language processing difficulty ( \gamma ), and a prompt-specific cross-lingual safety gap ( \tau ). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high- \tau prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with \tau , severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to \tau . In predictive validation, the IRT framework achieves \mathrmAUC = 0.940 , outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.
[NLP-113] OpenJarvis: Personal AI On Personal Devices
【速读】: 该论文试图解决的问题是:当前个人AI系统(如OpenClaw和Hermes Agent)高度依赖云端前沿大模型,导致敏感本地数据频繁上传,且在用本地模型替换云端模型时性能显著下降(如将Claude Opus 4.6替换为Qwen3.5-9B后准确率下降25–39个百分点)。现有架构将提示词、工具描述、记忆配置等耦合在特定云模型周围,仅能调整提示词,而最先进的提示优化器也无法完全弥合本地与云端模型之间的性能差距。
解决方案的关键在于提出一种分解式个人AI架构OpenJarvis,其核心思想是将系统拆解为五个可独立优化的原语(primitives):智能(Intelligence)、引擎(Engine)、代理(Agents)、工具(Tools)、记忆(Memory)和学习(Learning),并以类型化规范(typed spec)形式表示。通过引入LLM引导的规范搜索(LLM-guided spec search)机制,允许云端前沿模型在搜索阶段提出改进方案,仅保留不降低性能的修改,最终生成可在设备端运行的完整规范,从而实现本地部署下的高精度、低延迟和低成本推理。实验表明,该方法使本地模型在8个基准测试中4项达到或超过云端基线,并平均误差仅比最佳云端结果高出3.2个百分点,同时API成本降低约800倍,端到端延迟减少4倍。
链接: https://arxiv.org/abs/2605.17172
作者: Jon Saad-Falcon,Avanika Narayan,Robby Manihani,Tanvir Bhathal,Herumb Shandilya,Hakki Orhun Akengin,Gabriel Bo,Andrew Park,Matthew Hart,Caia Costello,Chuan Li,Christopher Ré,Azalia Mirhoseini
机构: Stanford Artificial Intelligence Laboratory (SAIL); Stanford HAI; Google DeepMind; Google Research; Google Cloud; Anthropic; Meta; IBM; Microsoft; Intel; NVIDIA; TSMC; ARM; Hitachi; BASF; Accenture; Ericsson; Qualcomm; Analog Devices; Salesforce; Total; Laude Institute; Prime Intellect; HAI-GCP Cloud Credits for Research program; Stanford Data Science Initiative (SDSI); Stanford Marlowe Computing Platform; NSF Graduate Research Fellowship Program; Knights-Hennessy Scholarship; Stanford Graduate Fellowship; JP Morgan AI/ML Fellowship; Stanford EDGE Fellowship; GEM Fellowship; Stanford DAWN project; Stanford SEAMS project
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL Website: this https URL
Abstract:Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.
[NLP-114] Multilingual and Multimodal LLM s in the Wild: Building for Low-Resource Languages
【速读】: 该论文试图解决多模态大语言模型(Multimodal LLMs)在低资源语言场景下,因数据和计算预算有限而难以实现跨语言、跨模态(视觉、语音、文本)对齐的问题。解决方案的关键在于:提出低成本的数据创建与整理方法、基于适配器(adapter)的三模态对齐架构、面向文化的评估机制以超越英语中心范式,并提供可微调的小型多语言视觉语言模型(VLM)及语音-文本-大语言模型(speech-text-LLM)的端到端流水线实操资源,从而推动多语言多模态AI在资源受限环境下的落地应用。
链接: https://arxiv.org/abs/2605.17152
作者: Firoj Alam,Shammur Absar Chowdhury,Enamul Hoque Prince
机构: Qatar Computing Research Institute, HBKU (卡塔尔计算研究研究所,HBKU); York University (约克大学)
类目: Computation and Language (cs.CL)
备注: Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Low-resources-language
Abstract:Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech-text-LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.
[NLP-115] UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
【速读】: 该论文旨在解决脑肿瘤诊断中依赖放射科医生对多序列、三维磁共振成像(MRI)进行人工解读所导致的认知负荷大、效率低且难以规模化的问题。当前视觉语言模型(VLMs)虽具备半自动化辅助解读的潜力,但在神经肿瘤学领域尚未得到充分应用,主要受限于缺乏专门用于评估其性能的临床相关基准。为此,作者提出了UCSF-PDGM-VQA数据集,包含473例胶质瘤相关MRI研究中的2,387个问答对,并为六种先进VLMs及一种大语言模型建立了性能基线。关键发现是:现有模型在处理多序列3D MRI时存在显著缺陷,表现为视觉特征被抑制、过度依赖语言先验信息,从而引发模态坍塌(modality collapse),暴露出当前模型在临床环境下的可靠性与安全性不足,亟需开发更鲁棒、面向特定医学领域的VLMs。
链接: https://arxiv.org/abs/2605.17140
作者: Shiv Ghosh,Junayd Lateef,Chih-Hua(Catherine)Liu,Yannan Yu,Andreas M. Rauschecker,Madhumita Sushil
机构: University of California, Berkeley (加州大学伯克利分校); University of California, San Francisco (加州大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 6 tables
Abstract:Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark – the UCSF-PDGM-VQA dataset – consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.
[NLP-116] he Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
【速读】: 该论文试图解决的问题是:如何识别语言模型在推理过程中“何时”开始承诺欺骗行为,而非仅将最终输出标记为诚实或欺骗(即把欺骗视为结果属性而非过程特征)。传统方法忽略了模型推理轨迹中的动态变化,导致无法定位欺骗行为的起点。解决方案的关键在于提出反事实定位(counterfactual localization)——对每个推理片段(sentence prefix)固定前缀并重采样后续内容,估算产生欺骗结果的概率,从而精确识别出模型从诚实转向欺骗的决策转折点。作者构建了五个具有明确战略激励机制的环境(如策略性虚张声势、迷宫引导等),使欺骗行为自然涌现且标签由环境状态决定,避免主观判断。实验结果显示,句子级别的欺骗承诺点可被人类评估验证,并发现基于注意力机制的转移特征具有跨环境泛化能力,而词法线索则表现不佳;进一步地,仅需选择少量(<10%)注意力头即可在未见环境中因果抑制欺骗承诺,表明欺骗承诺反映的是可复用的推理动态变化而非表面形式。该研究释放了一个包含约146万句的标注语料库,为深入理解语言模型推理中的承诺机制提供了新工具。
链接: https://arxiv.org/abs/2605.17113
作者: Scott Merrill,Shashank Srivastava
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 41 pages, 25 figures
Abstract:Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model’s reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes \sim 1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.
[NLP-117] HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
【速读】: 该论文试图解决大规模语言模型(LLM)部署中因模型池异构性导致的资源利用率低和成本控制难的问题,尤其是现有路由机制依赖二元强弱判断、参数与具体模型绑定,导致模型变更时需重新训练。解决方案的关键在于提出HyDRA(Hybrid Dynamic Routing Architecture),其核心创新是通过一个基于ModernBERT编码器的多头分类器(K=4独立sigmoid头)对每个查询精细预测推理、代码生成、调试和工具使用等多维能力需求,并采用缺口匹配(shortfall matching)算法从配置定义的模型能力 profile 中选择满足要求且成本最低的模型。该框架完全解耦模型目录,新增或移除模型仅需配置调整而无需重训练,实测在生产环境中推理延迟仅为86ms(CPU),并在多个基准测试中实现显著的成本节约(最高达72.5%)且保持高质量输出,同时首次在LLM路由领域验证了跨脚本族(CJK、欧洲语系等)的语言无关性路由能力。
链接: https://arxiv.org/abs/2605.17106
作者: Aashna Garg,Siddharth Singha Roy,Jinu Jang,Federico Brancasi,Shengyu Fu
机构: Microsoft
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, preprint v1. Production-telemetry tables and per-language breakdown deferred to v2
Abstract:Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog – adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA’s tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot’s VS Code Chat auto-mode and – to our knowledge for the first time in the LLM routing literature – demonstrates language-invariant routing across CJK, European, and other script families.
[NLP-118] SEMA-RAG : A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)在医疗问答中因采用单轮静态检索范式而与临床推理的多阶段特性不匹配的问题。其核心缺陷在于:问题到检索查询的转换缺乏临床语义理解,且检索过程缺少迭代反馈机制,难以构建可靠的证据链。解决方案的关键在于对原有单一推理链条进行任务解耦,引入动态多轮探索机制,并设计一个由三个专业代理组成的自进化多智能体框架SEMA-RAG:解释代理(Interpreter Agent)负责临床模式解析,探索代理(Explorer Agent)驱动基于充分性反馈的自演化检索,仲裁代理(Arbiter Agent)则执行证据评判与答案选择。实验表明,SEMA-RAG在五个基准和五种大语言模型(LLM)骨干架构上平均提升基线准确率6.46个百分点。
链接: https://arxiv.org/abs/2605.17101
作者: Yongfeng Huang,Ruiying Chen,James Cheng
机构: The Chinese University of Hong Kong (香港中文大学); Wuhan University of Technology (武汉理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.
[NLP-119] HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation
【速读】: 该论文试图解决的问题是:在将视觉语言模型(Vision-Language Model, VLM)蒸馏到高效混合架构(如3:1 Mamba-2/注意力机制混合结构)时,尽管整体性能保持稳定,但在特定任务(如光学字符识别和文档理解)上出现显著性能下降的问题。其关键解决方案在于识别出性能损失主要源于对高信息密度区域(如文本、边缘等局部细节)的建模不足——这些区域在图像中占比小但对任务至关重要。作者提出一种无需额外参数或推理成本的轻量干预方法HEED(High-density Efficient Distillation),通过使用patch自不相似性作为位置重要性的训练自由代理,将均匀的残差对齐替换为密度加权的残差对齐,从而显著提升细粒度文本理解能力。实验表明,HEED在OCR-Bench v2上提升8.7点,在10个基准平均上提升5.13点,且在不同教师模型与混合架构下均有效,最终实现4.12倍吞吐量提升和68%内存节省的同时达到教师模型水平性能。
链接: https://arxiv.org/abs/2605.17093
作者: Yihao Liang,Niraj K. Jha
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6 \times larger residual drift than the bottom 10% lowest-density patches and 3.5 \times larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12 \times throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.
[NLP-120] ACIL: Auto Chain of Thoughts for In-Context Learning
【速读】: 该论文试图解决的问题是:标准的上下文学习(In-Context Learning, ICL)在需要多步推理的任务上表现不佳,因为其提供的示例通常仅包含输入-输出对,缺乏显式的中间推理步骤,导致模型难以捕捉复杂任务的逻辑链条。解决方案的关键在于提出一种自动链式思维(Automatic Chain-of-Thought, Auto-CoT)框架,通过自动为输入-输出示例生成结构化的推理链(reasoning chains),将这些增强后的推理示例融入提示(prompt)上下文,并采用系统性筛选机制去除无关或低质量示例,从而引导模型进行更可靠、准确的推理。实验表明,该方法显著提升了ICL在多个推理任务上的性能。
链接: https://arxiv.org/abs/2605.17088
作者: Rui Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.
[NLP-121] Scale Determines Whether Language Models Organize Representation Geometry for Prediction
【速读】: 该论文试图解决的问题是:语言模型中表征空间的几何结构如何影响其预测能力,以及这种几何结构是否具有功能性组织——即是否专门服务于预测任务。解决方案的关键在于提出一种新的度量方法 Subspace PGA(Subspace Procrustes Analysis),用于检验某一层的表示距离结构是否比随机子空间更接近未嵌入矩阵 $ W_U $ 的读出子空间(readout subspace)。研究发现,中间层的几何结构显著地被组织用于预测(峰值 z 值为 9–24),但这一特性随模型规模而异:小模型(维度 $ d \leq 1024 )在训练后期逐渐丧失该结构,即使损失持续下降;而大模型( d \geq 2048 $)则能保持该结构。作者进一步指出,这种差异源于容量权衡:少数主导方向偏离 $ W_U $ 的读出子空间,掩盖而非破坏了潜在的预测结构,移除这些方向后可恢复对齐。这一现象无法通过谱特征或损失曲线捕捉,表明模型规模不仅决定性能,还决定其表征几何的组织方式以支持预测。
链接: https://arxiv.org/abs/2605.17084
作者: Weilun Xu
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer’s distance structure aligns with the readout subspace of the unembedding matrix W_U more than with random subspaces of equal size. Across seven Pythia models (70M–6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak z = 9 – 24 ), but the degree is scale-dependent: small models ( d \leq 1024 ) progressively lose it at late layers during training – even as loss keeps improving – while large models ( d \geq 2048 ) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from W_U 's readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.
[NLP-122] Can LLM s Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
【速读】: 该论文试图解决的问题是:当前大语言模型(LLM)在作为“数字消费者”模拟公众意见、预测试营销决策和预测受众反应时,缺乏对真实消费者在公共话语中表现出的具体反应模式的重建能力。现有评估方法多依赖整体偏好评分,难以准确衡量模型对具体反应点的覆盖情况。解决方案的关键在于提出ConsumerSimBench基准,该基准基于1,553个真实的中文社交媒体话题和23,122个经过规则审核的原子级反应标准(涵盖四类反应类型),将任务分解为可审计的“是/否”判断,显著提升了评估一致性(三评委一致率从65.8%提升至92.1%,且与人类多数标签的一致率达98.4%)。实证表明,即使是最强模型Gemini-3.1-Pro也仅能覆盖47.8%的真实反应标准,凸显了技术基准表现与社会语境下消费者直觉之间的巨大差距,同时验证了结构化推理提示效果有限,而生成-反思多智能体流水线可小幅提升性能,强调了未来需从“预测真实公众反应”的角度重构消费模拟任务。
链接: https://arxiv.org/abs/2605.17079
作者: Tianyu Wang,Jiajun Li,Jianghao Lin
机构: Shanghai Jiao Tong University (上海交通大学); Noumena AI (诺曼科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:LLMs are increasingly used as ``digital consumers’’ to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate–reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.
[NLP-123] RAG A: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有基于大语言模型(LLM)的知识图谱(KG)构建方法在跨片段语义关系捕捉、实体消歧以及构建过程可解释性方面的结构性缺陷,这些问题严重影响了知识图谱的质量、检索精度及在高风险领域部署时的信任度。其解决方案的关键在于提出RAGA(Reading And Graph-building Agent)框架,该框架通过嵌入“读取-搜索-验证-构建”认知约束的ReAct工具循环,实现对知识图谱全生命周期的原子级操作支持;同时引入KG-向量同步机制以实现符号与向量混合检索,并通过证据锚定验证机制确保每个知识条目均可追溯至原始文本,从而提升可审计性和可信度。初步实验表明,RAGA融合检索优于零样本基线,且知识图谱集成显著提升了答案与证据质量。
链接: https://arxiv.org/abs/2605.17072
作者: Chengrui Han,Zesheng Cheng
机构: Qingdao University (青岛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA’s fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.17072 [cs.AI] (or arXiv:2605.17072v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.17072 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-124] 1GC-7RC: One Graphic Card – Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
【速读】: 该论文试图解决的问题是:当前缺乏一个标准化的基准测试(benchmark)来评估自主AI编码代理在多样领域中从零开始设计、实现和训练机器学习模型的能力。解决方案的关键在于提出 1GC-7RC(Single Graphic Card: Seven Research Challenges),这是一个包含七个跨领域的机器学习任务的基准测试,涵盖语言建模、图像分类、语义分割、图学习、表格预测、时间序列预测和文本分类。每个任务提供锁定的数据预处理与评估脚本,代理仅可修改训练代码,不可访问预训练权重(语义分割任务除外)、无互联网访问权限,并需在单张GPU上于规定时间内(40–120分钟)完成任务。该设计通过严格的约束条件揭示了不同代理在隐含机器学习知识、规划能力和时间预算管理上的显著差异,同时具备模块化结构以支持未来扩展与多代理研究。
链接: https://arxiv.org/abs/2605.17046
作者: Robin-Nico Kampa,Fabian Deuser,Anna Bößendörfer,Konrad Habel,Norbert Oswald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC-7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at this https URL to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.
[NLP-125] D2Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning ICML2026
【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)在增强大语言模型(Large Language Models, LLMs)推理能力过程中面临的两个核心问题:有效数据稀缺性(Effective Data Scarcity)和难度动态漂移(Dynamic Difficulty Shifts),即中等难度训练样本稀少且随着模型性能提升逐渐变得简单。现有方法虽通过生成训练样本来缓解稀缺问题,但存在无锚点生成(anchor-free generation)、忽略模型与数据的协同进化(co-evolution)以及难度不匹配(difficulty mismatch)等缺陷。解决方案的关键在于提出 D² Evo——一种双难度感知的自进化强化学习框架,在每轮迭代中基于当前求解器(Solver)的能力挖掘中等难度锚点,训练提问者(Questioner)生成难度适配、多样化的题目,并联合优化两者以实现推理能力的渐进式提升。实验表明,D² Evo 在数学推理基准上仅用不到 2K 真实数学样本即超越现有方法,并在通用推理任务中展现出强泛化能力。
链接: https://arxiv.org/abs/2605.17037
作者: Ru Zhang,Renda Li,Ziyu Ma,Weijie Qiu,Chongyang Tao,Yong Wang,Xiangxiang Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICML 2026. First two authors contributed equally
Abstract:Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D ^2 Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver’s capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D ^2 Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.
[NLP-126] PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts NEURIPS2026
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在医疗、法律和科学等高风险场景中因幻觉(hallucination)导致错误输出的问题,旨在评估现有 hallucination 检测方法的真实有效性。解决方案的关键在于识别并消除基准数据集中存在的构造伪影(benchmark construction artifacts),特别是那些将真实答案直接嵌入输入提示(prompt)的语料库。作者通过大规模实验发现,许多先前报告的高检测性能实际上源于这些人为设计的漏洞,而非模型本身具备真正的检测能力;在控制这些伪影后,大多数基线方法表现接近随机水平,仅有两种监督探针——SAPLMA 和 DRIFT(一种基于层间隐藏状态转移的监督探针)——在受控条件下仍保持稳定性能,表明未来研究应聚焦于对高层隐藏状态的监督建模以提升检测可靠性。
链接: https://arxiv.org/abs/2605.17028
作者: Khizar Hussain,Murat Kantarcioglu
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint to Neurips 2026 submission
Abstract:Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textscTxTemb exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbfDRIFT, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field’s reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.
[NLP-127] he IsalProgram Programming Language
【速读】: 该论文试图解决的问题是:如何设计一种具有形式化可证明性质的编程语言,使其既具备简洁性和理论完备性,又能支持高效程序合成与探索。解决方案的关键在于提出 IsalProgram —— 一种类汇编语言,其核心创新在于:(1) 作为形式语言理论中的正则语言(regular language),其语法可由有限自动机接受;(2) 所有来自指令字母表的有限字符串均为语法合法程序,极大简化了生成空间;(3) 不依赖内存地址或变量名,仅通过三个数据指针操作一个循环双向链表(CDLL)实现数据管理,控制流由两个代码指针驱动。这一设计使得程序空间具有良好的结构特性,便于基于编辑距离(如 Levenshtein 距离)进行度量探索,并为神经程序合成、计算能力分析和复杂性研究提供了坚实基础。
链接: https://arxiv.org/abs/2605.17008
作者: Ezequiel López-Rubio
机构: University of Málaga (马拉加大学); ITIS Software (ITIS软件)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce IsalProgram (Instruction Set and Language for Programming), a novel assembly-like programming language with three distinctive theoretical properties: (1) it is a regular language in the sense of formal language theory, meaning its programs are accepted by a finite automaton; (2) every finite string over the instruction alphabet is a syntactically valid program; and (3) it makes no explicit use of memory addresses or variable names, absolute or relative. Programs are finite sequences of tokens drawn from a fixed instruction set, and are executed on a virtual machine whose sole data structure is a circular doubly linked list (CDLL) navigated by three data pointers, with control flow governed by two code pointers. We give a complete formal definition of the language and its virtual machine, prove its regularity, and demonstrate its expressive power. We further discuss IsalProgram’s potential advantages as a target language for neural program synthesis, the amenability of its program space to metric-based exploration via the Levenshtein edit distance, and directions for analyzing computability and complexity within this framework.
[NLP-128] HalluScore: Large Language Model Hallucination Question Answering Benchmark
【速读】: 该论文试图解决的问题是:当前大型语言模型(LLMs)在阿拉伯语场景下存在显著的幻觉(hallucination)现象,而现有评估基准主要集中在英语和中文,缺乏针对阿拉伯语的语言特性、文化背景和推理复杂性的系统性评测工具。解决方案的关键在于构建一个结构化的阿拉伯语问答基准——HalluScore,其核心创新包括:(1)涵盖不同推理难度、知识领域、历史时间线及文化语境的827个精心筛选的问题;(2)通过质量保障、事实有效性过滤和模型驱动的选择机制确保问题能稳定诱发幻觉;(3)每个问题均附带验证过的证据、答案解释和多标签标注,支持对幻觉类型(完全、部分或非幻觉)的细粒度分析;(4)基于该基准对17个阿拉伯语、多语言及推理型LLMs进行实证分析,揭示阿拉伯语LLM幻觉不仅限于事实错误,还涉及文化理解偏差、语言推理挑战与逻辑一致性问题。这一工作填补了阿拉伯语幻觉评测的空白,并为提升阿拉伯语LLMs的可靠性与文化适应性提供了关键资源。
链接: https://arxiv.org/abs/2605.17007
作者: Aisha Alansari,Hamzah Luqman
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language’s morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.
[NLP-129] Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够可靠地表达类人的人格特质,还是仅能通过表面线索进行模仿而缺乏稳定的心理学人格结构。解决方案的关键在于:通过在与目标大五人格(Big Five personality)特征匹配的长篇作文上对LLMs进行微调(SFT、DPO或ORPO),从而诱导其人格特征,并使用IPIP-NEO问卷评估诱导人格的稳定性与准确性。研究发现,微调显著降低了问卷响应的方差,缓解了预训练模型中评估脆弱性问题;但与此同时,即使单个特质得分有所提升,整体五维人格轮廓的准确率仍接近随机水平,揭示出无引导文本缺乏足够的人格线索来实现忠实表达。因此,作者主张采用场景驱动的数据集或交互式 elicitation 方法,以积累与测试目标一致的长期证据。
链接: https://arxiv.org/abs/2605.16996
作者: Prateek Rajput,Yewei Song,Iyiola E. Olatunji,Jacques Klein,Tegawendé F. Bissyandé
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 main pages, 5 figures, 4 main page figures
Abstract:Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? To investigate this, we induce personality in LLMs by fine-tuning them on the long-form essays, where each essay is associated with a target Big Five personality profile. We then evaluate the stability and fidelity of the induced personality using the IPIP-NEO questionnaire. Specifically, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Our results demonstrate that fine-tuning consistently reduces variance in questionnaire responses across five models, directly mitigating the evaluation fragility reported in pre-trained models. However, this newfound stability reveals a more fundamental limitation: accuracy on the full five-dimensional profile remains near chance, even when single-trait scores improve. This indicates that unguided essays lack the cues needed for faithful personality expression. We therefore argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.
[NLP-130] Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
【速读】: 该论文试图解决的问题是:在阅读理解多项选择题中,如何在不依赖答题数据的情况下建模题目难度(item difficulty),因为此类题目的难度通常取决于文本各组成部分之间的推理需求。传统方法通过人工提取文本特征并输入到独立的统计或机器学习模型中,但这一过程会丢失信息且效率较低。解决方案的关键在于采用端到端微调Transformer编码器的方法,直接从题目文字内容中学习难度特征,从而避免手动特征工程和预处理带来的信息损失。此外,论文提出两种改进方案:一种是对文本组件分别编码的分组件变体,另一种是在共享编码器上引入辅助多项选择题问答任务的多任务变体。实验表明,联合编码策略在小样本场景下表现优异,尤其是多任务变体能显著提升性能,证明了Transformer微调结合正则化辅助任务可有效恢复典型测量场景下的文本可推导信号,为心理测量学驱动的扩展提供了灵活接口。
链接: https://arxiv.org/abs/2605.16991
作者: Jan Netík,Patrícia Martinková
机构: Faculty of Education, Charles University, Prague, Czech Republic; Institute of Computer Science of the Czech Academy of Sciences, Prague, Czech Republic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.
[NLP-131] Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)代理在测试阶段执行新任务时,受限于静态技能库(static skill library)的通用性,难以获得足够具体的指导。为此,作者提出了一种名为SkillTTA(Test-Time Adaptive Skill Synthesis)的方法,其核心在于:通过检索与当前任务相关的少量训练轨迹,并将这些轨迹合成生成一个临时的、任务特定的文本技能(textual skill),从而为固定参数的求解器模型提供更具针对性的上下文引导,而非依赖参数更新进行适应。实验表明,该方法在SpreadsheetBench、ALFWorld和BigCodeBench上显著优于静态技能合成方法,尤其在任务特定技能指导下性能提升明显,且利用失败轨迹能有效识别评估者层面的常见错误,体现出更强的泛化能力和适应性。
链接: https://arxiv.org/abs/2605.16986
作者: Jingxing Wang,Chenyu Zhou,Zhihui Fu,Jun Wang,Weiwen Liu,Weinan Zhang,Jianghao Lin
机构: Shanghai Jiao Tong University (上海交通大学); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emphSkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top- k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.
[NLP-132] Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM -Based Multilingual Coreference Resolution
【速读】: 该论文试图解决的是跨语言、跨文档长度和跨标注规范下的指代消解(coreference resolution)问题,旨在提升大语言模型(LLM)在复杂多变的真实场景中的泛化能力。解决方案的关键在于:首先采用两阶段微调策略,先通过多语言基础适配器(multilingual base adapter)进行通用知识迁移,再引入数据集特定适配器(dataset-specific adapters)以适应不同任务的细粒度特征;其次,使用XML-inspired格式表示提及跨度(mention spans),并结合局部重索引机制增强结构表达能力,同时采用迭代式标注方式优化文档级一致性。这些设计共同提升了模型在多种语言和不同文档长度下的性能表现,最终在CRAC 2026共享任务中取得领先成绩。
链接: https://arxiv.org/abs/2605.16984
作者: Antoine Bourgois,Olga Seminck,Thierry Poibeau
机构: Lattice (CNRS UMR 8094 ENS-PSL Université Sorbonne Nouvelle)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.
[NLP-133] How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
【速读】: 该论文试图解决的问题是:人类如何识别或被生成式AI(Generative AI)产生的幻觉误导,其背后的认知机制尚不明确。解决方案的关键在于通过记录27名参与者在执行图像描述验证任务时的脑电图(EEG)信号,利用平均事件相关电位(ERP)分析揭示大脑在处理幻觉内容与非幻觉内容时的不同神经动力学特征。研究发现,语义整合、推理加工、记忆检索和认知负荷等多重认知过程在面对幻觉内容时表现出显著差异;更重要的是,被错误判断的幻觉与正确判断的幻觉在神经响应上存在显著区别,表明错误判断的幻觉未能激活标准的神经认知事实验证通路。
链接: https://arxiv.org/abs/2605.16953
作者: Shuqi Zhu,Yi Zhong,Ziyi Ye,Bangde Du,Yujia Zhou,Qingyao Ai,Yiqun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans’ neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.
[NLP-134] Roll Out and Roll Back: Diffusion LLM s are Their Own Efficiency Teachers
【速读】: 该论文试图解决扩散大语言模型(Diffusion Large Language Models, DLLMs)在开源场景下面临的质量-速度权衡问题:现有方法通过并行生成多个token以加速解码,但常导致显著的质量下降。其核心原因是训练与推理之间的不匹配,这种不匹配因不可逆的解码过程而被放大——训练时从随机损坏状态重建token,而高效推理需要自适应的去噪顺序,即先生成较易预测的token,延迟依赖上下文的token。解决方案的关键在于两个互补策略:一是提出无需训练的推理端方法Wide-In, Narrow-Out (WINO),使并行解码具备可撤销性,通过激进地草拟多个token、利用全局上下文验证并重新掩码不可靠token进行后续优化;二是引入训练端扩展WINO+,将WINO发现的可靠去噪轨迹注入模型参数,使训练过程对齐高效推理需求。实验表明,WINO在GSM8K上将准确率从73.24%提升至75.82%,推理步数减少6.10倍,而WINO+进一步提升至76.58%,步数减少6.83倍,同时在Flickr30K上实现16.22倍步数缩减且CIDEr指标提升,证明DLLMs可通过可撤销解码自我教学,从而实现高质量与高效率的统一。
链接: https://arxiv.org/abs/2605.16941
作者: Fanqin Zeng,Feng Hong,Geng Yu,Huangjie Zheng,Xiaofeng Cao,Ya Zhang,Bo Han,Yanfeng Wang,Jiangchao Yao
机构: Shanghai Jiao Tong University (上海交通大学); Apple MLR (苹果机器学习研究); Tongji University (同济大学); Hong Kong Baptist University (香港浸会大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at this https URL.
[NLP-135] Effort as Ceiling Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
【速读】: 该论文试图解决的问题是:大型推理模型(Large Reasoning Models, LRMs)生成的思维链(chain-of-thought)轨迹长度是否真正反映了人类认知任务中的计算结构,还是仅仅表现为表面冗余。解决方案的关键在于通过控制推理时的努力水平(effort levels),在不同模型规模(GPT-OSS-20B 和 GPT-OSS-120B)和六种推理任务中系统测试这种对齐关系。结果发现,无论任务内还是跨任务,对齐程度保持不变,且贝叶斯因子支持零假设,表明这种对齐并非由推理时的动态资源分配驱动,而是训练阶段形成的固化策略;此外,算术复杂度分析显示 token 分配能精细匹配人类难度模式,且模型规模提升进一步增强对齐效果,从而支持“编译式”(compiled)而非“在线式”(online)的 LRM 问题求解机制。
链接: https://arxiv.org/abs/2605.16938
作者: Yueqing Hu,Tianhong Wang
机构: Institute of Neuroscience, Chinese Academy of Sciences (中国科学院神经科学研究所); School of Philosophy, Anhui University (安徽大学哲学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 8 pages, 6 figures
Abstract:Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.
[NLP-136] Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理中因全注意力机制(full attention)带来的二次计算复杂度问题。现有高效替代方案通常依赖于原生稀疏训练或启发式Token剔除策略,导致效率、训练成本与精度之间存在权衡。其解决方案的关键在于发现并利用模型内在的稀疏性:首先,仅少数注意力头需要处理完整长上下文;其次,长距离信息检索可由低维子空间(16维索引器)高效完成;最后,有效Token预算具有强查询依赖性,动态Top-p选择优于固定Top-k稀疏化。基于此,作者提出RTPurbo方法,仅对检索头保留完整的键值缓存(KV cache),并引入轻量级Token索引器实现稀疏注意力。该方法通过极少的训练步数(数百步)即可实现高稀疏性,在长上下文基准和推理任务中保持近乎无损的准确性,并带来显著效率提升——在1M上下文长度下预填充速度最高提升9.36倍,解码速度提升约2.01倍,且无需昂贵的原生稀疏预训练。
链接: https://arxiv.org/abs/2605.16928
作者: Yanke Zhou,Yiduo Li,Hanlin Tang,Maohua Li,Kan Liu,Lan Tao,Lin Qu,Yuan Yao,Xiaoxing Ma
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures
Abstract:Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top- p selection more suitable than fixed top- k sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model’s intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36 \times prefill speedup at 1M context and about a 2.01 \times decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.
[NLP-137] JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
【速读】: 该论文旨在解决大规模关键词词典在上下文感知自动语音识别(Contextual ASR)中引入噪声、导致识别准确率下降的问题。现有动态过滤方法依赖基础ASR模型生成候选词,再通过语义文本检索器筛选相关关键词,但在中文场景下效果不佳,因基础模型常产生同音或近音错误,虽保留目标词的发音特征却严重扭曲语义,使标准语义检索失效。论文提出一种融合语义(Semantic)、拼音(Pinyin)和字形(Glyph)特征的联合过滤框架(JSPG),其中拼音基于发音相似性精准召回目标词,字形提供结构信息以过滤大量同音异义词;为实现字符级特征与序列级过滤之间的映射,进一步设计了一种扩展的Smith-Waterman算法,用于计算N-best假设序列与关键词间的相似度得分。实验表明,JSPG显著优于单一特征基线,在Aishell-1和RWCS-NER数据集上提升关键词识别准确率,并有效增强下游上下文ASR模型性能。
链接: https://arxiv.org/abs/2605.16896
作者: Shilin Zhou,Zhenghua Li
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.
[NLP-138] he Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence
【速读】: 该论文试图解决的问题是:当前基于大语言模型(LLM)的端到端交易代理(end-to-end LLM trading agents)在公开基准测试中报告的高额夏普比率(Sharpe ratio)缺乏结构有效性验证,导致学术界与工业界对这些模型实际部署能力的判断存在严重偏差。解决方案的关键在于提出一套最小化报告协议(P1–P6),要求模型必须通过包括时间完整性、现实摩擦、反事实鲁棒性、预测校准、数值执行和多智能体解耦在内的多项结构性检验,才能被视为具备可部署的交易能力;同时,论文还建议采用一种保守的模块化替代架构,将LLM作为上游可审计的信息接口,而非直接决策引擎,从而隔离模型偏见并提升系统透明度与可信度。
链接: https://arxiv.org/abs/2605.16895
作者: Yuxuan Ye,Jun Han,Ao Hu,Juncheng Bu,Yiyi Chen,Liangjian Wen,Danilo Mandic,Danny Dongning Sun,Xu Yinghui,Zenglin Xu
机构: Fudan University (复旦大学); Shanghai University of Finance and Economics (上海财经大学); Southwest University of Finance and Economics (西南财经大学); Northeastern University (东北大学); Imperial College London (帝国理工学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia–industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi-agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short-window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1–P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \urlthis https URL.
[NLP-139] DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios
【速读】: 该论文试图解决自动驾驶车辆在安全关键环境中进行细粒度、空间定位的风险评估问题,即当前零样本多模态大语言模型(Multimodal Large Language Models, MLLMs)在此类任务上表现不足的问题。解决方案的关键在于提出DriveSafe框架,其核心创新是利用结构化的自然语言描述来增强场景理解:首先生成融合运动、空间和深度等多模态信息的时空锚定字幕(spatially grounded captions),再基于这些字幕进行下游风险识别与安全建议生成;同时引入字幕-风险配对数据微调一个轻量级适配模块(adapter module),高效注入领域特定知识至基础大语言模型(LLM)。该方法通过显式语言表征驱动风险判断,在DRAMA基准测试中实现最先进的性能,验证了其有效性。
链接: https://arxiv.org/abs/2605.16892
作者: Sainithin Artham,Shankar Gangisetty,Avijit Dasgupta,C. V. Jawahar
机构: IIIT-Hyderabad (印度国际信息技术学院-海得拉巴)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages
Abstract:Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: this https URL research/projects/cvit-projects/drivesafe
[NLP-140] E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
【速读】: 该论文试图解决在低资源环境下,如何高效部署多个专家模型合并后的量化问题。具体而言,现有方法直接对合并后的模型应用后训练量化(PTQ)时效果不可靠,因为存在两类耦合偏差:由低比特重建引入的量化偏差和由模型合并继承的专家相对偏差。解决方案的关键在于提出E-PMQ(Expert-guided Post-Merge Quantization)框架,其核心机制包括两个方面:一是利用源专家权重在层级校准过程中提供专家引导的输出目标,以缓解合并偏差;二是通过合并权重锚定(merged-weight anchoring)稳定校准过程并保留合并模型的整体行为。实验表明,E-PMQ显著提升了4-bit及更低比特下的性能,在CLIP-ViT和FLAN-T5等多任务场景中均实现大幅改进,验证了其在低比特部署中的有效性。
链接: https://arxiv.org/abs/2605.16882
作者: Wenjun Wang,Yanggan Gu,Shuo Cai,Yuanyi Wang,Pengkai Wang,Jianmin Wu,Hongxia Yang
机构: The Hong Kong Polytechnic University (香港理工大学); PolyU-Daya Bay Technology and Innovation Research Institute (PolyU-大亚湾科技与创新研究院); InfiX.ai (InfiX.ai)
类目: Computation and Language (cs.CL)
备注:
Abstract:Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.
[NLP-141] PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
【速读】: 该论文试图解决的问题是:当前机器翻译评估基准(如BLEU等)难以适配古典文本的多义性和多样性,因为这些文本常存在多种忠实且合理的译法,而标准基准仅以单一参考译文为标准,无法准确反映模型在处理此类文本时的真实性能。解决方案的关键在于提出PaliBench——一个面向巴利语到英文翻译的多参考基准,并提供一套可复用的方法论用于构建其他古典语言的多参考翻译评估体系。该方法整合了LLM辅助对齐独立译本、自动化校验、段落级质量过滤、公式化重复去重及多指标综合评价,最终形成包含1700个段落、约34.5万词元的高质量基准数据集,成功实现了对十种大语言模型的系统性评估,同时验证了其在处理具有解释空间的经典文本时的稳健性和可靠性差异。
链接: https://arxiv.org/abs/2605.16881
作者: Máté Metzger,Nadnapang Phophichit
机构: Mahachulalongkornrajavidyalaya University (玛哈柴伦隆皇家大学); International Buddhist Studies College (国际佛教研究学院)
类目: Computation and Language (cs.CL)
备注: Preprint. This manuscript has not yet been peer reviewed
Abstract:Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.
[NLP-142] MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
【速读】: 该论文试图解决的是监督微调(Supervised Fine-Tuning, SFT)在向语言模型注入新知识时导致的预训练能力退化问题,尤其是推理能力和通用领域性能的下降。其关键解决方案是提出一种无需外部教师的分布对齐方法 MixSD(Mix-based Self-Distillation),该方法通过动态构造监督信号来保持与基础模型自回归分布的一致性:具体而言,MixSD 利用模型自身生成的两个条件分布——一个包含注入事实的“专家条件”和一个反映原始先验的“朴素条件”——混合生成监督序列,从而在保留事实学习信号的同时显著提升与基线模型分布的契合度。实验表明,MixSD 在多个模型规模和任务场景下均优于标准 SFT 和基于策略的自蒸馏方法,在几乎不损失原生能力的前提下实现接近完美的训练准确率,且能降低负对数似然(NLL)并减少沿 Fisher 敏感方向的有害参数移动,验证了“将监督信号对齐至模型本征生成分布”这一原则在缓解灾难性遗忘中的有效性。
链接: https://arxiv.org/abs/2605.16865
作者: Jiarui Liu,Lechen Zhang,Yongjin Yang,Yinghui He,Yingheng Wang,Weihao Xuan,Zhijing Jin,Mona Diab
机构: Carnegie Mellon University; University of Toronto; Vector Institute; University of Illinois Urbana-Champaign; Princeton University; Cornell University; The University of Tokyo; RIKEN AIP; Max Planck Institute for Intelligent Systems, Tübingen, Germany; EuroSafeAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model’s autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model’s original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model’s distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model’s held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model’s native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.
[NLP-143] hinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction
【速读】: 该论文试图解决的问题是:当前视觉语言模型(Vision-Language Models, VLMs)在面对复杂视觉输入时,难以从原始视觉信息中进行有效规划,其根本瓶颈在于感知能力受限于单步感知(one-step perception)的局限性。解决方案的关键在于将“图像思维”(Thinking with Images, TWI)作为构建和迭代更新内部世界模型(internal world model)的工具,通过逐步获取局部视觉证据来增强VLM的规划能力;进一步地,为提升效率,提出“模式推理”(Pattern Inference)策略,使模型能够主动识别新任务中的已知视觉模式,并直接推断局部世界模型结构;同时引入“模式诱导”(Pattern Induction)机制,以在线归纳学习方式自动发现并优化可复用的视觉模式专家(visual pattern experts),从而实现高准确率与低计算开销之间的平衡。
链接: https://arxiv.org/abs/2605.16848
作者: Yichang Jian,Boyuan Xiao,Zhenyuan Huang,Yifei Peng,Yao-Xiang Ding
机构: Zhejiang University (浙江大学); State Key Lab of CAD CG (CAD/CAM国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.
[NLP-144] RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis
【速读】: 该论文试图解决的问题是:印度《信息自由法》(Right to Information Act, 2005)赋予公民向公共机构申请获取信息的权利,但实践中普通民众难以理解中央信息委员会(Central Information Commission, CIC)裁决中使用的复杂行政语言,更无法判断其上诉是否值得提起。为应对这一挑战,论文提出并构建了RTI-Bench——一个结构化的CIC裁决数据集,包含结果标签、豁免条款引用、IRAC(Issue-Rule-Application-Conclusion)式推理组件及程序时间线。该数据集的关键创新在于首次公开发布面向印度信息自由行政裁决的结构化标注资源,涵盖1,218条来自指令-响应语料库的数据和298份从CIC官网直接收集的PDF裁决(覆盖五位委员及三种文档格式版本,时间跨度为2023至2026年),并实现了高达95.3%的手动验证标签精度。实验表明,基于Mistral 7B模型的零样本基线在结果预测任务上达到57.3%准确率与37.0%宏F1分数,显著优于多数类基线(14.3%宏F1),证明该数据集具备支撑生成式AI辅助决策分析的潜力。
链接: https://arxiv.org/abs/2605.16843
作者: Joy Bose
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 tables
Abstract:India’s Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at this https URL
[NLP-145] CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
【速读】: 该论文旨在解决分块预填充(chunked prefill)场景下高效注意力计算的难题,特别是现有稀疏注意力方法在该场景中效率低下或引入额外开销的问题。其核心挑战在于:传统块稀疏内核在查询长度受限于分块大小时效率下降,而细粒度模式搜索在重复访问累积的键值缓存(KV cache)时成本高昂;同时,近期方法QUOKA虽针对此场景优化,但依赖查询子采样导致关键KV信息丢失,并引入显式KV复制开销。解决方案的关键是提出CompactAttention机制,它基于块联合KV选择(Block-Union KV Selection),将二维块稀疏掩码视为KV选择信号而非直接的稀疏内核执行计划,并通过Q块并集与组内并集操作生成最小化的、符合分页执行约束的每组KV块表,从而实现无显式KV压缩即可原位访问选中的KV块,显著提升效率并保持精度。在LLaMA-3.1-8B-Instruct模型上,CompactAttention在128K上下文长度下实现了高达2.72倍的注意力加速,且在RULER基准测试中保持接近稠密注意力的准确性。
链接: https://arxiv.org/abs/2605.16839
作者: Jiwon Song,Dongwon Jo,Beomseok Kang,Jae-Joon Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72 \times attention speedup at 128K context length under chunked prefill.
[NLP-146] Constrained Code Generation with Discrete Diffusion
【速读】: 该论文试图解决的问题是:如何在代码生成过程中有效约束程序的功能正确性、安全性与语法合法性,同时避免传统方法中依赖大量修正计算或训练过程的局限性。解决方案的关键在于提出一种无需训练的神经符号推理框架 Constrained Diffusion for Code (CDC),其核心创新是将约束满足机制直接嵌入到离散扩散模型的反向去噪过程中——通过引入基于数学优化与程序分析的约束感知去噪算子,识别中间程序状态中与约束相关的区域,并局部调整去噪轨迹,从而在不偏离原始模型分布的前提下引导生成更可行的程序。该方法显著提升了各类约束下的满足率,且所需修正计算更少、修改更集中。
链接: https://arxiv.org/abs/2605.16829
作者: Lize Shao,Michael Cardei,Zichen Xie,Ferdinando Fioretto,Wenxi Wang
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:
Abstract:Discrete diffusion models are a powerful, emerging paradigm for code generation. They construct programs through iterative refinement of partially corrupted token sequences and enable parallel token refinement. Importantly, this paradigm exposes a global program state at each denoising step, which provides a natural intervention point for enforcing program-level functionality and security constraints, guiding the generation before the final code is committed. Building on this observation, the paper introduces Constrained Diffusion for Code (CDC), a training-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process. CDC augments the base discrete diffusion sampler with constraint-aware denoising operators that combine mathematical optimization with program analysis to identify constraint-relevant regions of the intermediate program state and locally adjust the denoising trajectory, steering generation toward feasible programs while remaining close to the base model. Across code generation benchmarks, CDC consistently improves constraint satisfaction in functional correctness, security, and even syntax, outperforming discrete diffusion and autoregressive baselines with less corrective computation and more localized edits.
[NLP-147] Decoupling KL and Trajectories: A Unified Perspective for SFT DAgger Offline RL and OPD in LLM Distillation
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)后训练阶段知识蒸馏(Knowledge Distillation)设计空间不清晰的问题,尤其是在与强化学习(Reinforcement Learning, RL)结合时缺乏理论指导和实践优化方法。其解决方案的关键在于揭示并解耦传统蒸馏方法中隐含的两个正交维度:前缀来源(prefix source)和token级KL散度方向(token-level KL direction),从而构建出四个独立有效的蒸馏目标。通过梯度层面的数学分析,作者证明了前向KL对应监督微调(SFT)风格的目标,而反向KL则等价于强化学习风格的策略梯度目标,将不同蒸馏范式统一到同一理论框架下。基于此,论文进一步提出KL混合(KL mixing)与熵门控长度课程(entropy-gated length curriculum)两种实用策略,有效平衡了准确性、多样性、计算效率与RL行为的一致性,显著提升了数学推理任务中的性能表现。
链接: https://arxiv.org/abs/2605.16826
作者: Anhao Zhao,Haoran Xin,Yingqi Fan,Junlong Tong,Wenjie Li,Xiaoyu Shen
机构: Eastern Institute of Technology, Ningbo; The Hong Kong Polytechnic University; Shanghai Jiao Tong University; The Hong Kong University of Science and Technology (Guangzhou)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.
[NLP-148] Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在推理过程中产生的token级置信度轨迹(confidence trajectories)是否与最终答案的正确性相关,以及能否利用这些轨迹来提升推理结果的准确性。其解决方案的关键在于发现并利用置信度轨迹中蕴含的与内容无关的置信度几何结构(content-agnostic confidence geometry),该结构能够以低维表示区分正确与错误的推理路径。研究进一步表明,正确性信息主要集中在推理尾部的置信度动态变化中,并据此提出轻量级评估器NeuralConf,通过学习置信度轨迹来预测推理正确性,在固定生成轨迹预算下优于多数投票、尾部置信度等静态基线方法,从而实现基于模型自身置信动态的推理优化。
链接: https://arxiv.org/abs/2605.16824
作者: Shuo Liu,Ding Liu,Shi-Ju Ran
机构: Tiangong University (天工大学); Capital Normal University (首都师范大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 9 figures, 1 table. Code is available at this https URL
Abstract:Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies–Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.
[NLP-149] Agent KernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
【速读】: 该论文试图解决的问题是:当前AI编码代理在GPU内核优化任务中缺乏系统性评估,尤其是现有基准测试仅关注单次大语言模型(LLM)调用,而非完整的代理工作流,并且未涵盖内核到内核的优化以及对未见配置的泛化能力测试。解决方案的关键在于提出AgentKernelArena——一个开源基准测试框架,用于衡量AI编码代理在GPU内核优化中的完整工作流表现。该框架包含196个任务,覆盖HIP-to-HIP、Triton-to-Triton及PyTorch-to-HIP三种优化场景,通过隔离工作空间、受控编译与性能检查、正确性验证和集中评分机制,尤其引入了“未见配置泛化协议”,以评估优化是否能迁移到代理从未见过的输入配置上。实验表明,尽管多数代理在编译成功率和正确性方面表现良好,但在PyTorch-to-HIP任务中存在显著的正确性下降,说明代理从零生成内核时常隐含特定形状假设,而HIP-to-HIP和Triton-to-Triton优化则具有较好的泛化能力。
链接: https://arxiv.org/abs/2605.16819
作者: Sharareh Younesian,Wenwen Ouyang,Sina Rafati,Mehdi Rezagholizadeh,Sharon Zhou,Ji Liu,Yue Liu,Yuchen Yang,Hao Li,Ziqiong Liu,Dong Li,Vikram Appia,Zhenyu Gu,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.
[NLP-150] FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation
【速读】: 该论文试图解决的问题是:低秩适配(LoRA)在训练过程中为每个权重矩阵分配相同秩值,忽略了不同层对任务适应的贡献差异,从而可能导致资源分配效率低下。其解决方案的关键在于提出一种轻量级工程方法——在微调开始前进行八次校准反向传播,通过计算每个LoRA-B矩阵的梯度方差作为层重要性的代理指标,并将总秩预算按比例重新分配至各层。该方法无需引入新参数、不增加训练开销、也不改变部署基础设施,且通过高效近似经验Fisher信息矩阵(eFIM)对角线(仅限LoRA适配器矩阵)实现内存成本降低约256倍。实验表明,FIM-LoRA在GLUE基准上与标准LoRA性能相当(DeBERTa-v3-base:88.6 vs. 88.7),在LLaMA-3-8B上的常识推理任务中也接近原生LoRA(68.5 vs. 68.7),同时生成的逐层秩映射具有可解释性,符合Transformer结构中价值投影和早期到中期层更关键的已有认知。
链接: https://arxiv.org/abs/2605.16800
作者: Ramakrishnan Sathyavageeswaran
机构: Intuit(易智瑞)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 1 figure
Abstract:Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.
[NLP-151] IER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition NEURIPS2026
【速读】: 该论文试图解决大语言模型在多步工具使用(tool use)任务中因奖励信号稀疏或依赖标注参考路径而导致的可扩展性问题。现有强化学习方法中,基于结果的奖励提供稀疏反馈,而基于轨迹监督的奖励则受限于人工标注的参考解,无法奖励有效的替代路径,从而限制了模型在复杂组合任务中的泛化能力。解决方案的关键在于提出TIER(Trajectory-Invariant Execution Rewards)框架,该框架通过函数schema和运行时执行信息直接生成奖励,而非依赖参考轨迹。TIER将奖励分解为格式有效性、schema合规性、执行成功性和答案正确性四个维度,提供细粒度、可解释且密集的序列级反馈,使任何有效执行路径都能获得奖励,从而支持多种解法策略并适应工具接口的演化。实验表明,在DepthBench基准上,TIER在1至6步任务中均达到90%准确率,显著优于依赖轨迹监督的方法(后者在第4步后性能崩溃),同时在BFCL v3和NestFUL等基准上也展现出一致提升,验证了多层级监督对组合推理的重要性。
链接: https://arxiv.org/abs/2605.16790
作者: Anay Kulkarni,ChiaEn Lu,Dheeraj Mekala,Jayanth Srinivasa,Gaowen Liu,Jingbo Shang
机构: UC San Diego; Cisco Research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at this https URL
Abstract:Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves 90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.
[NLP-152] he Unlearnability Phenomenon in RLVR for Language Models ICML2026
【速读】: 该论文试图解决的问题是:在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)中,尽管存在正确的轨迹(correct rollouts),仍有一部分困难样本无法被大语言模型(LLM)有效学习,而这一现象此前未被深入研究。解决方案的关键在于通过系统性分析揭示了这些不可学习样本的根本原因——它们存在本质性的表示缺陷(representation issue),表现为与其他样本梯度相似度低、推理模式不具备泛化能力;同时发现当前优化和采样技术以及数据增强手段均难以缓解此类问题,从而指出了现有RL方法在推理任务中的根本局限性。
链接: https://arxiv.org/abs/2605.16787
作者: Yulin Chen,He He,Chen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICML 2026
Abstract:Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model’s (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \urlthis https URL.
[NLP-153] Exploring Lightweight Large Language Models for Court View Generation
【速读】: 该论文旨在解决法律人工智能(Legal AI)中刑事法院视角生成(Criminal Court View Generation, CVG)任务的性能与效率问题,尤其是探索轻量级大语言模型(Lightweight Large Language Models, LLMs,参数量小于2B)在CVG及后续量刑预测中的能力。其解决方案的关键在于:首先构建了一个名为CVGEvalKit的评估框架,包含三个公开数据集及对应的量刑预测标签;其次通过系统性实验对比不同架构、模型规模的轻量级LLMs与深度神经网络(Deep Neural Networks, DNNs)在CVG质量与量刑预测准确性上的表现,并进一步验证“先生成法院视角再预测量刑”这一两阶段策略相较于直接量刑预测的优势。实验结果揭示了模型架构、规模与多任务协同之间的权衡关系,凸显了轻量级LLMs在司法AI场景下的实用潜力。
链接: https://arxiv.org/abs/2605.16770
作者: Zhitian Hou,Tianyong Hao,Nanli Zeng,Zhixiong Chao,Kun Zeng
机构: Sun Yat-sen University (中山大学); South China Normal University (华南师范大学); China Mobile Internet Co., Ltd. (中国移动互联网有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset’s test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \urlthis https URL
[NLP-154] Retrieval-Based Multi-Label Legal Annotation: Extensible Data-Efficient and Hallucination-Free
【速读】: 该论文旨在解决多标签法律标注任务中面临的两大挑战:一是传统参数化编码器在标签集变化时需重新训练,难以适应动态演化的法律分类体系;二是生成式大语言模型(Generative Large Language Models, LLMs)在标签空间扩大时计算成本剧增且性能下降。其解决方案的关键在于将法律标注问题转化为检索任务——使用冻结的检索模型对文档和标签描述进行嵌入,并通过嵌入空间中的k近邻(k-nearest neighbors)预测标签,从而避免梯度反向传播,仅需重新嵌入和重索引即可更新模型。实验表明,该方法在三个法律数据集上达到与现有方法相当甚至更优的准确率,同时显著提升数据效率(如Eurlex上仅用100样本即使Micro-F1提升至48.29,远超Hierarchical Legal-BERT的27.87),并大幅降低计算开销(相比微调减少20–30倍)。此外,论文还揭示了生成式推理存在标签越界(hallucination)的问题,而基于检索的方法天然保证标签合法性,具备更高的可靠性。
链接: https://arxiv.org/abs/2605.16767
作者: Li Zhang,Jaromir Savelka,Kevin Ashley
机构: University of Pittsburgh (匹兹堡大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures
Abstract:Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.
[NLP-155] Language Acquisition Device in Large Language Models ACL2026
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在数据效率上远低于人类的问题。现有研究通过预预训练(Pre-pretraining, PPT)在合成语言上提升模型效率,但多依赖于高表达力的形式语言(如k-Shuffle Dyck),未能充分模拟人类语言习得机制。论文提出受语言习得装置(Language Acquisition Device, LAD)假说启发的PPT方法:在MP-STRUCT这一形式语言上进行预预训练,该语言通过MERGE、AGREE和MOVE操作编码层次结构、特征依赖与长距离移位,更贴近人类语言的内在语法约束。其关键创新在于,仅需500步预预训练即可在词元效率上媲美强基线,并赋予模型对结构不合理语言(如REVERSE)的人类级鲁棒性;进一步分析表明,功能地标(functional landmarks)显著降低了依赖解析歧义,揭示了有效PPT设计不仅依赖表达能力,更取决于依赖解析的可访问性——这挑战了此前认为PPT语言必须同时具备层次表达能力和电路理论可学习性的假设。
链接: https://arxiv.org/abs/2605.16758
作者: Masato Mita,Taiga Someya,Ryo Yoshida,Yohei Oseki
机构: The University of Tokyo
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2026 Main Conference
Abstract:Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as k -Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner’s hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms k -Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.
[NLP-156] EmoMind: Decoding Affective Captions from Human Brain fMRI
【速读】: 该论文试图解决当前脑到文本(brain-to-text)系统在生成描述时忽略情感信息的问题,以及现有基于离散情绪标签的提示方法无法捕捉个体间丰富情感差异的局限性。其解决方案的关键在于提出EmoMind——首个端到端从fMRI信号直接解码情感化描述的框架:首先从脑激活信号中提取语义上合理的中性场景描述,再利用同一fMRI记录中解码出的34维连续情感向量对描述进行重写;通过无分类器引导(classifier-free guidance)训练重写模块,在保持语义一致性与增强情感表达之间实现可控平衡。实验验证表明,EmoMind在个体特异性、结构几何性和因果控制三个维度上均显著优于以GPT-4配合离散情绪标签为基准的方法,尤其在需要个体化情感结构建模的任务中表现突出,证明了连续脑解码情感信号作为个性化情感描述控制信号的可行性。
链接: https://arxiv.org/abs/2605.16739
作者: Bilal A. Mohammed,Lin Gu,Ruogo Fang
机构: Vanderbilt University (范德比尔特大学); Tohoku University (东北大学); University of Florida (佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Decoding visual experience from brain activity has advanced substantially, but cur- rent brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semanti- cally grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective cap- tion generation and open new directions for studying individual affective brain organisation.
[NLP-157] CHI-Bench: Can AI Agents Automate End-to-End Long-Horizon Policy-Rich Healthcare Workflows?
【速读】: 该论文试图解决当前医疗自动化基准在三个关键能力上的缺失问题:政策密度(policy density),即决策必须基于庞大的医学、保险和运营规则库;多角色组合(multi-role composition),即单个任务要求代理扮演多个角色并完成交接;以及多方交互(multilateral interaction),即中间工作流程涉及多轮对话,如同行评审和患者联络。解决方案的关键在于提出一个名为χ-Bench的长期医疗工作流基准,涵盖三个领域:提供方事前授权、支付方利用管理与护理管理。每个任务通过高保真模拟器向代理呈现临床案例,并暴露87个MCP工具以驱动20个医疗应用程序,代理需通过工具调用和撰写各角色文档,最终达成终端状态,整个过程受控于一本包含1,290余份文档的管理式医疗操作手册。实验表明,最佳代理仅能解决28.0%的任务,无任何代理在严格标准下达到20%,且单次会话执行全部任务时性能骤降至3.8%,这揭示了政策密集型、角色复合型、不可逆的企业场景中存在显著性能差距。
链接: https://arxiv.org/abs/2605.16679
作者: Haolin Chen,Deon Metelski,Leon Qi,Tao Xia,Joonyul Lee,Steve Brown,Kevin Riley,Frank Wang,T. Y. Alvin Liu,Hank Capps MD,Zeyu Tang,Xiangchen Song,Lingjing Kong,Fan Feng,Tianyi Zeng,Zhiwei Liu,Zixian Ma,Hang Jiang,Fangli Geng,Yuan Yuan,Chenyu You,Qingsong Wen,Hua Wei,Yanjie Fu,Yue Zhao,Carl Yang,Biwei Huang,Kun Zhang,Caiming Xiong,Sanmi Koyejo,Eric P. Xing,Philip S. Yu,Weiran Yao
机构: actAVA.ai; Johns Hopkins Medicine; Wellstar Health System; Stanford University; CMU; UCSD; Yale School of Medicine; Salesforce AI Research; University of Washington; Northeastern University; Brown University; Boston College; Stony Brook University; University of Oxford; Arizona State University; University of Southern California; Emory University; MBZUAI; Recursive Superintelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Website: this https URL Code: this https URL Dataset: this https URL
Abstract:End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce \chi -Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role’s artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.
[NLP-158] A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research
【速读】: 该论文试图解决的问题是:如何在大规模语料中有效识别句子语境下的“方式动词”(manner verbs)和“结果动词”(result verbs),以支持早期动词习得研究中的语义分析。由于目前缺乏大规模标注资源,这一区分难以实现自动化测量。其解决方案的关键在于:首先利用语言学启发的提示(linguistically informed prompts)通过大语言模型对来自MASC和InterCorp的数据进行句级标注,扩展VerbNet中已标注的436个动词类别;随后基于这些自动标注数据训练一个RoBERTa分类器,并在三个保留的黄金标准数据集上进行评估,结果显示平均准确率达89.6%。该方法为未来发展语言学和语义研究提供了可扩展的测量工具,但对边界案例、混合类型动词及下游应用仍需进一步验证。
链接: https://arxiv.org/abs/2605.16654
作者: Divyesh Pratap Singh,Dakshesh Gusain,Federica Bulgarelli,Alison Eisel Hendricks,John Beavers,Nathan M. Beers,Ifeoma Nwogu
机构: University at Buffalo; Nanyang Technological University; The University of Texas at Austin
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.
[NLP-159] SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
【速读】: 该论文试图解决多轮对话系统评估中的长期一致性问题,即现有自动评估方法(如基于大语言模型的评分框架和嵌入式指标)因依赖扁平化或孤立回合表示,难以有效检测跨轮次的矛盾、话题漂移和实体不一致等长程缺陷。其解决方案的关键在于提出SKG-Eval框架,该框架将对话建模为随轮次演化的语义知识图谱(Semantic Knowledge Graph, SKG),通过结构化三元组提取增量更新图结构,并计算三个互补信号:(i) 局部相关性(当前轮次与提示及参考的一致性)、(ii) 历史一致性(新信息与历史上下文的连接性,结合图结构与嵌入驱动信号)、(iii) 逻辑连贯性(基于几何矛盾引擎无须自然语言推理模型即可检测跨轮冲突)。这些信号经递归加权趋势分析融合为长度不变的会话评分,从而在多个基准上显著提升与人工判断的相关性,并实现可复现、可审计的矛盾证据生成。
链接: https://arxiv.org/abs/2605.16650
作者: Avijit Shil,Suman Samui
机构: Maulana Abul Kalam Azad University of Technology, West Bengal, India; National Institute of Technology Durgapur, West Bengal, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 Pages, 6 Figures
Abstract:Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.
[NLP-160] Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text
【速读】: 该论文试图解决的问题是:传统情感分析中依赖离散分类(如“积极”、“中性”、“消极”)的方法在实际应用领域(如金融)中存在局限性,无法准确捕捉情绪强度的细微差异,从而影响决策质量。其解决方案的关键在于提出一种基于生成式语言模型的情感强度评估框架,通过构建情感强度评分数据集并微调开源大模型,使其输出0-100之间的连续数值,从而实现对文本中情绪强度的更精细、更具泛化能力的量化表达。该方法不仅优于传统分类基线,还在情绪、情绪唤醒度等相近概念上展现出良好的迁移能力,为自然语言处理与金融等领域提供了更契合实际需求的情绪建模新范式。
链接: https://arxiv.org/abs/2605.16613
作者: Francesco A. Fabozzi,Dasol Kim,William N. Goetzmann
机构: Yale School of Management (耶鲁管理学院); Office of Financial Research (金融研究办公室)
类目: Computation and Language (cs.CL); General Economics (econ.GN); General Finance (q-fin.GN)
备注: 10 pages, no figures, 5 tables
Abstract:We introduce a novel approach to emotion modeling that shifts the focus from identification to evaluation, addressing the limitations of discrete classification in applied domains such as finance. By constructing a dataset of emotional intensity scores and fine-tuning open-weight generative language models to output continuous values from 0-100, we demonstrate a more expressive, generalizable framework for sentiment and emotion analysis. Our findings not only outperform classification baselines but also reveal surprising generalization capabilities and transfer effects to related constructs such as sentiment and arousal. This work contributes to the interdisciplinary recontextualization of NLP by introducing emotion intensity evaluation as an alternative to classification, arguing that this shift better aligns with the needs of domains–such as finance–where the degree of emotional content is central to interpretation and decision-making. Comments: 10 pages, no figures, 5 tables Subjects: Computation and Language (cs.CL); General Economics (econ.GN); General Finance (q-fin.GN) Cite as: arXiv:2605.16613 [cs.CL] (or arXiv:2605.16613v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.16613 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: William Goetzmann [view email] [v1] Fri, 15 May 2026 20:32:29 UTC (34 KB) Full-text links: Access Paper: View a PDF of the paper titled Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text, by Francesco A. Fabozzi and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs econ econ.GN q-fin q-fin.EC q-fin.GN References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-161] o MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings Except In Heavy Truncation Scenarios
【速读】: 该论文试图解决的问题是:在文本编码器训练中,Matryoshka Representation Learning (MRL) 是否真的能提升嵌入向量在不同尺寸下(通过截断获得)的下游任务性能,相较于随机截断方法是否具有显著优势。解决方案的关键在于:通过在相同截断策略下对比 MRL 与非 MRL 训练模型的表现,发现除非进行极端截断(即向量维度减少至少 80%),否则非 MRL 模型的截断嵌入在多个模型和下游任务中不仅不逊色,反而常常优于 MRL 模型。这表明截断鲁棒性可能并非来自 MRL 本身,而是源于其他因素,因此是否采用 MRL 应取决于是否需要极端压缩嵌入维度。
链接: https://arxiv.org/abs/2605.16608
作者: Sotaro Takeshita,Yurina Takeshita,Simone Paolo Ponzetto,Daniel Ruffinelli
机构: University of Mannheim (曼海姆大学); NEC Laboratories Europe (NEC欧洲实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired.
[NLP-162] Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
【速读】: 该论文试图解决的问题是:在基于交叉熵的预训练与偏好对齐(preference alignment)过程中,尽管两者更新相同的Transformer权重,却会在权重空间中留下几何上不同的痕迹,这种不对称性如何产生并可被量化。解决方案的关键在于提出了一种相对子空间分数探测器(relative-subspace-fraction probe),用于追踪权重变化(weight deltas)相对于残差流激活子空间和由解嵌入矩阵定义的预测子空间的对齐情况。研究发现,读路径权重(W_Q, W_K)的更新集中在注意力输入激活主方向上,而写路径权重(W_O, W_2)则相对于预测子空间近似各向同性。这一现象可通过各向异性梯度累积机制解释:权重更新为外积之和 δ_t a_t^T,其方向结构继承自协方差集中的那一侧;对于读路径,输入激活 a_t 的协方差在训练后的模型中具有尖峰特征,导致目标无关的方向集中;而对于写路径,相关侧是上游梯度 δ_t,其各向异性取决于损失函数——交叉熵提供样本级锐利信号,使预训练期间写路径获得预测几何结构,而偏好对齐目标通常不会进一步增强写路径的方向集中。作者通过检查点内轨迹、分级对比目标控制实验以及匹配方向的闭式秩1干预,提供了因果证据支持该权重空间几何假设。
链接: https://arxiv.org/abs/2605.16600
作者: Valeria Ruscio,Eli-Shaoul Khedouri,Keiran Thompson
机构: Intuition Machines
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ( W_Q , W_K ), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ( W_O , W_2 ) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix W are sums of outer products \delta_t a_t^\top , and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation a_t , whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient \delta_t , whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.
[NLP-163] Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4
【速读】: 该论文试图解决的问题是:如何提升arXiv平台上科学论文的可访问性与用户体验,特别是在数学内容的HTML格式转换和无障碍阅读方面。解决方案的关键在于:(1)通过社区驱动的改进提升HTML渲染精度和服务稳定性(已解决约一半的用户报告问题);(2)推进大规模文献的HTML转换工作,目标实现90%无错误的转换率(当前为75%);(3)引入MathML 4 Intent标注以支持无障碍语音输出;(4)将LaTeXML迁移至Rust语言,显著降低计算成本并加快提交时的预览速度。这些进展表明,arXiv HTML Papers项目正逐步从实验阶段走向成熟,更好地满足读者需求并利用新兴技术标准与编程语言进步。
链接: https://arxiv.org/abs/2605.16562
作者: Deyan Ginev,Brian Caruso,Bruce Miller,Jeff Sank,Jacob Weiskoff
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 6 pages, ICMS 2026
Abstract:We report on the ongoing development of arXiv’s HTML Papers offering, available on every new TeX/LaTeX submission since its initial release in 2023. The main highlights from 2025 and early 2026 are: (i) community-driven improvements to HTML fidelity and service health, with roughly half of 6,000 user reports resolved; (ii) corpus-scale conversion work aimed at 90% error-free HTML (currently 75%); (iii) initial MathML 4 Intent annotations for accessible speech output; (iv) an in-progress Rust port of LaTeXML, reducing compute costs and enabling faster previews on submission. The arXiv HTML Papers project remains experimental, but is gradually maturing as we better understand the needs of arXiv’s readers and the technical opportunities presented by new standards and by advances in programming languages and AI. Comments: 6 pages, ICMS 2026 Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL) MSC classes: 68U15 (Primary) 68V25, 68U35 (Secondary) ACMclasses: I.7.2; H.3.7 Cite as: arXiv:2605.16562 [cs.CL] (or arXiv:2605.16562v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.16562 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Deyan Ginev [view email] [v1] Fri, 15 May 2026 19:04:45 UTC (25 KB) Full-text links: Access Paper: View a PDF of the paper titled Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4, by Deyan Ginev and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs cs.DL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-164] PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
【速读】: 该论文试图解决大语言模型(Large Language Models, LLM)代理在实际应用中难以有效评估的问题,特别是如何高效识别那些因真实用户意图而触发的失败案例,而非仅依赖对抗性输入。现有方法主要关注由恶意用户构造的攻击场景,忽视了符合真实用户意图但依然导致代理失效的查询。解决方案的关键在于提出PQR框架,其核心机制是通过两个互补模块的迭代交互:查询优化模块负责生成多样化的查询变体以探索潜在失败场景,提示优化模块则基于历史反馈提炼出违反特定目标(如有用性、安全性等)的新策略和真实性规则,从而生成既具有破坏性又贴近真实用户意图的测试查询。实验表明,PQR在电商问答代理上能发现23%–78%更多的无帮助响应,且生成的查询更具多样性与现实性。
链接: https://arxiv.org/abs/2605.16551
作者: Yunan Lu,Luigi Liu,Omar Yahia,Arpit Sharma,Zhou Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users’ intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent’s unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.
[NLP-165] Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces
【速读】: 该论文试图解决医疗领域中语音识别系统在准确性、专业术语处理和临床场景适应性方面的局限性问题。现有方案通常仅针对通用语音转录或特定字典录入流程优化,难以满足安全关键环境下的可靠性要求及更广泛的临床工作流需求。解决方案的关键在于提出Symphony系统,其将语音转文字过程分解为专门的识别、格式化与上下文校正模块,从而在保持高医学术语召回率的同时,实现实时生成结构化临床文本,并具备跨应用场景的自适应能力。评估结果表明,Symphony在医疗语料上显著优于当前最优系统,同时在通用领域表现持平甚至超越,证明其具有良好的泛化能力而非过拟合。此外,作者还发布了首个面向临床验证的基准数据集,以推动该领域的持续进步。
链接: https://arxiv.org/abs/2605.16545
作者: Arne Nix,Robert James,Lasse Borgholt,Anna B. Ekner,Lana Krumm,Julius Severin,Dan Engel,Lars Maaløe,Jakob Havtorn
机构: Corti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.
[NLP-166] he Scaling Laws of Skills in LLM Agent Systems
【速读】: 该论文试图解决大规模代理系统中技能库(skill library)扩展时性能下降的问题,特别是技能路由准确率随库规模增长而衰减、以及执行阶段错误累积导致下游任务失败的机制不明确问题。解决方案的关键在于发现并验证两个耦合的尺度定律:一是“路由定律”,即单步路由准确率随技能库规模对数衰减(R²=0.97),误差由局部技能竞争演变为跨家族漂移和被通用“黑洞技能”劫持;二是“执行定律”,即在状态实现前联合路由近似乘积关系,且正确执行可使困难下游决策成功率提升约4倍。这两个定律通过一个单一参数——路由对数衰减斜率 $ b $ ——紧密耦合,表明技能库结构特性同时控制预执行阶段的性能崩溃与下游任务的可恢复性。基于此规律的优化策略显著提升了路由准确率(71.3%→91.7%)、降低了劫持率(22.4%→4.1%),并在ClawBench和ClawMark基准上分别将平均通过率提升至61.6%和34.5%,证明了技能库结构、粒度与暴露策略对代理性能的核心影响。
链接: https://arxiv.org/abs/2605.16508
作者: Charles Chen,Qiming Yu,Yuhang Gu,Zhuoye Huang,Hanjing Li,Hongyu Liu,Simin Liu,Jinhao Liu,Dengyun Peng,Jiangyi Wang,Zheng Yan,Fanqing Meng,Ethan Qin,Carl Che,Mengkang Hu
机构: Evolvent AI Team (Evolvent AI 团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ( R^20.97 for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general “black-hole skills”. Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about 4\times . A single parameter, the routing logarithmic decay slope b , couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.
[NLP-167] Generative Artificial Intelligence for Literature Reviews
【速读】: 该论文试图解决的问题是:如何在文献综述(literature review)中有效、可靠地应用生成式人工智能(Generative AI),以提升其效率与质量,同时识别并应对由此带来的潜在风险。解决方案的关键在于构建一套基于大语言模型(Large Language Models, LLMs)的系统性方法论框架,涵盖通用型工具(如ChatGPT、Gemini、Claude)和专业型工具(如Consensus、Elicit)的使用策略,并通过结构化提示词(prompts)示例和方法学严谨的实践路径,指导用户在保持科学严谨性的前提下利用GenAI完成文献筛选、信息提取、摘要生成和知识整合等任务。论文强调采用平衡视角,在充分发挥GenAI能力的同时,审慎评估其可靠性、偏见及对科学研究长期发展的影响,从而为未来改进GenAI架构与训练数据、完善文献综述方法论提供研究方向。
链接: https://arxiv.org/abs/2605.16475
作者: Gerit Wagner,Julian Prester,Reza Mousavi,Roman Lukyanenko,Guy Pare
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:Generative artificial intelligence (GenAI), based on large-language models (LLMs), such as ChatGPT, has taken organizations, academia, and the public by storm. In particular, impressive GenAI capabilities such as summarization of large text corpora, question-answering, data extraction, and translation, carry profound implications for the conduct of literature reviews. This impacts science, organizations and the general public, as all can benefit from GenAI-supported literature reviews. Building on the technical foundations of GenAI and grounded in established methodological discourse, this work outlines approaches for conducting literature reviews using both general-purpose (e.g., ChatGPT, Gemini, Claude) and specialized GenAI tools (e.g., Consensus, Elicit). We provide illustrative examples of prompts and suggest methodologically-sound literature review strategies. Throughout this perspective paper, we adopt a balanced approach considering both the opportunities and the risks of relying on GenAI in the conduct of literature reviews. We conclude by discussing philosophical questions related to the effects of GenAI on long-term scientific progress, and also present fruitful opportunities for research on improving the core of GenAI’s technology-its architecture and training data-and suggest open issues in GenAI-based literature reviews methodology.
[NLP-168] Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
【速读】: 该论文试图解决的问题是:如何从自然图像中识别出驱动大脑皮层特定体素(voxel)活动的视觉特征,从而突破现有基于人工神经网络的编码模型仅能提供相关性解释、无法揭示机制细节的局限。其解决方案的关键在于提出了一种名为MINE(Mechanistically Interpretable Neural Encoding)的新框架,该框架结合语言对齐的图像表征与可解释性工具,能够定位自然图像中驱动每个体素响应的具体视觉特征,并生成语义可解释的描述;进一步地,将这些特征归纳为每个体素的功能特征谱(per-voxel functional profiles),并通过反事实编辑实验验证其因果作用——即插入或移除预测特征会按预期方向改变体素激活水平,且基于功能谱的编辑效果更强,证明该方法能准确刻画体素的选择性。这为发现和因果验证关于神经功能的精细假设提供了新路径。
链接: https://arxiv.org/abs/2605.16468
作者: Idan Daniel Grosbard,Mor Geva,Galit Yovel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 40 pages, 28 figures
Abstract:A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel’s response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel’s response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel’s activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel’s selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.
[NLP-169] Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的幻觉问题,即模型在生成文本时可能产生语法上合理但与视觉内容不一致或缺乏视觉依据的响应。其核心解决方案是提出一种分阶段的偏好优化(preference optimization)框架,通过有针对性的多模态数据构建来减少幻觉。关键在于:不是直接在通用指令遵循数据上进行优化,而是逐步构造靠近已知失败边界、聚焦于幻觉场景的偏好对,特别强调模糊的空间关系、物体间关联、OCR不确定性以及对抗性虚假前提训练;同时利用最小扰动但视觉不一致的“幻觉负样本”,使直接偏好优化(Direct Preference Optimization, DPO)能更有效地将基于视觉 grounded 的推理与看似合理的幻觉区分开来。实验表明,该方法在开源基准和真实多模态评估中显著提升了视觉一致性、减少了幻觉,并生成更具信息量的 grounded 回应,甚至优于多个前沿闭源 VLM。这揭示了幻觉不仅源于模型容量限制,也源于自回归概率生成机制在弱视觉引导下倾向于选择语言上更连贯的输出。
链接: https://arxiv.org/abs/2605.16411
作者: Qinwu Xu
机构: Meta AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2605.16411 [cs.CV] (or arXiv:2605.16411v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.16411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-170] Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(MLLMs)在真实场景图像中普遍存在的光学字符识别(OCR)和多语言文本理解失败问题,尤其是在布局杂乱、字体微小、模糊、遮挡及复杂排版等退化视觉条件下。其解决方案的关键在于提出了一种面向OCR的多模态训练框架,包含三个核心组件:(i) 大规模合成OCR到翻译的数据生成,(ii) 基于LoRA适配的OCR感知监督微调(SFT),以及(iii) 结构化视觉思维链(CoT)提示机制,用于在不确定视觉条件下增强推理能力。该方法显著提升了OCR完整性、多语言翻译准确率及在劣质图像下的鲁棒性,尤其在提取小尺寸、模糊、分散或部分遮挡文本时表现优异,并减少了对语言先验的依赖,从而有效降低视觉模糊场景下的幻觉现象。
链接: https://arxiv.org/abs/2605.16409
作者: Qinwu Xu,Xin Liu,Yifan Jiang,Haoyu Ren
机构: Meta AI; The University of Texas at Austin
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.16409 [cs.CV] (or arXiv:2605.16409v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.16409 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qinwu Xu [view email] [v1] Wed, 13 May 2026 14:16:39 UTC (5,153 KB)
[NLP-171] Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture
【速读】: 该论文试图解决的问题是:如何在高风险场景下(如专利检索、金融监管、临床决策支持等)对大型语言模型(LLM)的结构化计算流程进行形式化验证,而非直接验证模型本身。其核心挑战在于确保LLM流水线中各组件的确定性行为和可靠性,同时避免依赖不可信或不可审计的假设。解决方案的关键在于构建一个基于Lean 4的可信任边界框架,通过三类局部证书家族(冲突感知双格接地、嵌入敏感性与改写稳定性、霍尔风格代理动作)和两个核心操作符(最大可认证残差算子、组合稳定性定理),实现对整个流水线扰动预算的闭式建模与审计日志记录,并以“通用保证卡”整合为每次调用的交付成果。所有证书均通过Lean 4内核类型检查且无sorry声明,仅依赖于明确划分层级的假设(数学占位符、密码学假设、ML/人类预言机),从而保障了形式化验证的严谨性和可审计性。
链接: https://arxiv.org/abs/2605.16407
作者: George Koomullil
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Programming Languages (cs.PL)
备注: 83 pages, 1 figure, 12 tables
Abstract:We present a framework for verifying the deterministic structured computations surrounding a large language model rather than the model itself, extending a Lean 4 trust-boundary architecture to the generic interfaces of modern LLM pipelines. Certificate validity is a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the trusted set propext, this http URL, this http URL; other assumptions are declared and partitioned by tier (mathematical placeholders, cryptographic assumptions, ML/human oracles). The technical contribution comprises three local certificate families and two operators. The families are conflict-aware bilattice grounding (with an emission-gate soundness lemma), embedding sensitivity and paraphrase stability, and Hoare-style agent action. The operators are a Maximal Certifiable Residue, which turns abstention into the maximum-weight certifiable residue with audit-logged dropped claims, and a Compositional Stability theorem, which yields a closed-form pipeline-wide perturbation budget from per-layer gains and margins. The three families plus a Universal Assurance Card consolidator form the per-call deliverable for high-stakes deployments: patent and legal retrieval, regulated finance, clinical decision support, and agentic systems with irreversible side effects. A compiled Lean 4 reference artifact (Lean v4.30.0-rc2, Mathlib) covers all 22 certificate types, with 17 of 46 kernel-audited declarations axiom-free, the rest depending only on the trusted set and declared assumptions, and zero uses of sorryAx or this http URL. The three families are empirically tested through four registered pilots: bilattice grounding on adversarially perturbed HotpotQA, embedding sensitivity in short- and long-form settings, and Hoare-style agent action on a filesystem sandbox with adversarial prompt injection. Comments: 83 pages, 1 figure, 12 tables Subjects: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Programming Languages (cs.PL) MSC classes: 68Q60, 68V15, 68T50 ACMclasses: F.3.1; D.2.4; F.4.1; I.2.7 Cite as: arXiv:2605.16407 [cs.LO] (or arXiv:2605.16407v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2605.16407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-172] Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning CVPR2026
【速读】: 该论文旨在解决立体几何(solid geometry)中多模态推理能力不足的问题,即现有方法大多局限于平面几何(plane geometry),难以处理三维空间图形的复杂结构与逻辑推理。其解决方案的关键在于提出Hilbert-Geo框架,这是一个统一的形式化语言体系,包含丰富的谓词库(predicate library)和专用定理库(theorem bank),并设计了Parse2Reason方法:第一步通过条件描述语言(Conditional Description Language, CDL)将自然语言问题和视觉图像统一转化为形式化表示;第二步基于CDL表达式与定理库进行关系推理与代数计算,从而生成严格正确、可验证且人类可读的推理链。此方法在两个新构建的专家标注数据集SolidFGeo2k和PlaneFGeo3k上实现了显著优于主流多模态大语言模型(MLLMs)的性能,证明了其在立体几何和平面几何中的通用性与有效性。
链接: https://arxiv.org/abs/2605.16385
作者: Ruoran Xu,Haoyu Cheng,Bin Dong,Qiufeng Wang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Ricoh Software Research Center Beijing Co., Ltd (理光软件研发中心北京有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR2026
Abstract:Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.
[NLP-173] WASIL: In-the-Wild Arabic Spoken Interactions with LLM s
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)语音助手在真实场景中因自动语音识别(Automatic Speech Recognition, ASR)错误导致用户意图扭曲的问题,同时区分由ASR错误引起的响应质量下降与固有不可回答性(如模糊、域外或非请求类对话)之间的干扰。解决方案的关键在于构建了WASIL数据集——一个包含阿拉伯语口语交互的多模态数据集,涵盖音频、ASR假设、助理回复及显式的点赞/点踩反馈(共8,529个回合,14.2%为不喜欢),并引入低成本高质量黄金转录文本(通过多ASR一致性的后编辑获得),以及对回答可答性进行细粒度标注(可回答、模糊/需澄清、不支持、非请求/噪声),从而实现对ASR影响的有效隔离;此外,提出了一种基于多判别器LLM评分的无参考评估方法,用于规模化比较ASR输入与黄金转录输入下的响应质量差异。
链接: https://arxiv.org/abs/2605.16364
作者: Zien Sheikh Ali,Hamdy Mubarak,Soon-Gyo Jung,Hunzalah Hassan Bhatti,Firoj Alam,Shammur Absar Chowdhury
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Spoken Prompts, Multilingual LLMs, Speech-based Evaluation, Dialectal Speech, Low-resource Languages, Conversational AI, Speech-to-Text QA, Real-world Interaction, Spoken Language Understanding
Abstract:Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.
[NLP-174] DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models)在强化学习(RL)训练中面临的两个关键问题:一是缺乏对去噪轨迹中不同步骤的时间信用分配(temporal credit assignment),二是依赖于高方差、有偏的均场似然估计(mean-field likelihood estimates)进行策略优化。解决方案的核心是提出一种轻量级、可插拔的增强方法——DACA-GRPO,其包含两个互补机制:一是去噪进度评分(Denoising Progress Scores),通过中间预测结果无额外前向计算提取每个token的重要性权重;二是分层掩码似然(Stratified Masking Likelihood),将token位置划分为若干层,使每个token在尽可能多的上下文条件下被预测,从而降低均场估计偏差。实验表明,DACA-GRPO在三种GRPO基线方法上均实现稳定提升,在数学推理、代码生成、约束满足和JSON模式遵循等七个基准任务中取得显著改进,最高提升达5.6个百分点(pp)以上。
链接: https://arxiv.org/abs/2605.16342
作者: Amin Karimi Monsefi,Dominic Culver,Nikhil Bhendawade,Lokesh Boominathan,Manuel R. Ciosici,Yizhe Zhang,Irina Belousova
机构: The Ohio State University (俄亥俄州立大学); Apple(苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.
[NLP-175] Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment
【速读】: 该论文试图解决历史档案数字化过程中产生的“暗数据”(dark data)问题——即大量缺乏元数据的数字对象难以被检索和利用。传统手工编目方式效率低、成本高,限制了档案的发现与再利用。解决方案的关键在于提出一个名为Vidya的模块化流水线系统,该系统通过协调大型语言模型(LLMs)与开源软件(FOSS)工具,实现语义增强与档案入库的规模化自动化。Vidya通过YAML定义的本体(ontology)和Pydantic验证机制约束生成内容,将概率性模型输出转化为确定性的结构化JSON格式,从而确保输出的一致性和可操作性。该系统基于Pontagrossa州立大学数字人文与创新实验室(LAMUHDI)开发,采用Maker原则和开源实践,在资源有限的机构中也能以低成本部署,显著缩短处理时间(从数十年降至数天),同时符合NOBRADE和ISAD(G)国际档案标准。
链接: https://arxiv.org/abs/2605.16338
作者: Cloter Migliorini Filho,Julia Graciela Machado,Edson Armando Silva,Marcella Scoczynski
机构: 1. University of São Paulo (圣保罗大学); 2. Federal University of Rio Grande do Sul (南里奥格兰德联邦大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:The large-scale digitization of historical archives has created a paradox: “dark data”-digital objects lacking metadata for retrieval. Manual archival description is slow and expensive, limiting discovery and reuse. We propose Vidya, a modular pipeline that orchestrates Large Language Models (LLMs) and FOSS tools to automate semantic enrichment and archival ingestion at scale. Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models. Developed at Laboratory for Digital Humanities and Innovation (LAMUHDI) of the State University of Ponta Grossa (UEPG), Vidya applies Maker principles and open-source practices to enable low-cost deployment in memory institutions using modest hardware. We compare LLM performance and present a cost-benefit analysis showing major gains, reducing processing time from decades to days while complying with NOBRADE and ISAD(G).
[NLP-176] From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes
【速读】: 该论文试图解决的问题是:仅使用人口统计学特征(如国家、年龄、性别、就业状态、收入、教育程度和婚姻状况)定义的大语言模型(LLM)代理在预测调查回答时是否存在偏差,以及其表现是否优于基于更广泛领域内调查响应定义的代理。解决方案的关键在于对比两类代理的预测准确性——一类仅依赖人口统计学变量(demographics-only agents),另一类则基于更丰富的调查响应数据构建(survey agents)。研究发现,仅用人口统计学特征定义的代理存在“中心趋势偏差”(central tendency bias),导致预测结果过度集中在人群均值,并且错误地高估了准确性,未能再现人类受访者常见的错误回答和“不知道”选项;而基于调查响应训练的代理不仅能准确捕捉个体差异,还能重现多个政策相关变量之间的交互效应,从而更真实地模拟人类行为。这表明,在构建用于预测调查数据的LLM代理时,应避免仅依赖人口统计学信息,而需引入更多与任务相关的具体调查内容以提升模型效度。
链接: https://arxiv.org/abs/2605.16303
作者: Rubén Garzón,Pauline Baron,Vincent Grari,Jonne Kamphorst,Michael Bernstein,Marcin Detyniecki
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 50 pages, 22 figures
Abstract:Large language models (LLM) agents may offer tools to predict human responses to surveys. A common technique for defining these agents uses only demographics, for example country, age, gender, employment status, income, education and marital status. We compare the predictive accuracy of demographic agents to that of survey agents defined with a larger set of in-domain survey responses. We test both approaches in predicting responses to the multidisciplinary, cross-national Survey of Health, Ageing and Retirement in Europe (SHARE), focusing on five variables from three policy-relevant constructs around personal finance. In these three constructs, we observe that, compared to survey agents trained on broader data, demographics-only agents (1) exhibited a central tendency bias, skewing answers toward population means, and (2) were unrealistically accurate, failing to reproduce the incorrect answers and “don’t know” responses typical of human respondents. These performance differences are further substantiated through the replication of a hierarchical regression analysis from prior retirement planning research. Agents based solely on demographic information reproduce the outcome that financial risk tolerance, future time perspective, and knowledge of retirement planning each are predictive of retirement savings. However, only the survey-anchored agents succeed in reproducing the interaction among these three factors. These findings suggest caution in using only demographics to define LLM agents for predicting survey responses.
[NLP-177] Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在多步推理任务中因稀疏终端奖励导致的信用分配(credit assignment)困难问题。传统方法将最终奖励均匀传播至所有中间决策,造成梯度方差高、训练不稳定以及大量无效更新,阻碍模型持续改进。其解决方案的关键在于提出一种基于反事实比较的信用分配框架:通过在同一输入下采样多个推理轨迹,利用它们之间的差异作为替代决策的隐式近似,构建一个过程级优势估计器,从而将稀疏终端奖励转化为对每一步决策敏感的学习信号。在此基础上,作者进一步提出隐式行为策略优化(Implicit Behavior Policy Optimization, IBPO),显著提升了训练稳定性和数学与代码推理任务上的性能上限,为释放LLMs的推理潜力提供了新方向。
链接: https://arxiv.org/abs/2605.16302
作者: Fei Ding,Yongkang Zhang,Yeling Peng,Youwei Wang,Guoxiong Zhou,Zijian Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.
[NLP-178] Linguistic Uncertainty and Reply Engagement on X: A Cross-Domain Replication of the Uncertainty-Reply Asymmetry
【速读】: 该论文试图解决的问题是:社交媒体中语言不确定性(linguistic uncertainty)与用户参与度(engagement)之间的关系在不同语言和话题背景下是否具有一致性,尤其是此前在阿拉伯语研究中观察到的“不确定性-回复不对称性”(Uncertainty-Reply Asymmetry)是否能在更广泛的情境下复现。解决方案的关键在于:采用基于词典的不确定性分类框架对2,258条英文帖子进行标注,发现约三分之一的帖子被识别为不确定内容;通过回归分析验证了不确定性与回复数量之间存在显著正相关关系(β = 0.126, p = 0.011),即不确定帖子平均获得82%更多的回复,而点赞和转发增幅较小,从而证实了该不对称参与模式在英语语境下的普遍性,表明语言不确定性可能是一种跨语言、跨话题的通用互动机制。
链接: https://arxiv.org/abs/2605.16289
作者: Mohamed Soufan
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 13 pages, 2 figures, 2 tables
Abstract:Linguistic uncertainty is common in social media, but its relationship with engagement remains unclear across languages and topics. Using 2,258 English-language posts on Federal Reserve policy, inflation, and electoral politics collected over three days in April 2026, we test whether the Uncertainty-Reply Asymmetry observed in prior Arabic-language research replicates in a broader context. Posts are classified using a lexicon-based uncertainty framework, with approximately one-third identified as uncertain. Uncertain posts receive 82% more replies on average than certain posts, with smaller increases in reposts and likes, replicating the asymmetric engagement pattern observed in prior work. Regression results confirm a positive and statistically significant association between uncertainty and replies (\beta = 0.126, p = 0.011), equivalent to ~13% higher expected reply engagement, while total engagement shows a positive but weaker association. These findings suggest that linguistic uncertainty systematically increases conversational engagement and may reflect a general interactional mechanism across languages and domains.
[NLP-179] When AI Tells You What You Want to Hear: Sycophantic Behavior of Large Language Models in Dementia Care Settings
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在痴呆症照护等高风险临床场景中是否表现出迎合性行为(sycophantic behavior),即响应质量是否会因提示(prompt)中社会期待信号(如权威暗示或确认性框架)的增强而下降,而非保持专业伦理标准。解决方案的关键在于通过系统化设计五组逐步强化权威性和确认性语境的提示(从P1中性到P5权威支持),并利用“LLM作为裁判”(LLM-as-a-Judge)方法对四个主流LLM(GPT-5、Claude Sonnet 4.6、Gemini 3.1 Pro、Mistral Large)生成的100条响应进行量化评估,结果发现所有模型均呈现显著负相关关系(Spearman相关系数ρ从-0.543至-0.734,p < 0.01),表明提示框架对响应质量具有强烈影响,尤其以Mistral Large最为明显(ρ = -0.734),其伦理评分从P1的6.0/7降至P5的0.2/7。这揭示了LLM在医疗应用中的情境敏感风险,并强调提示工程(prompt engineering)作为影响输出质量的核心变量,在当前医疗AI部署中尚未受到足够重视。
链接: https://arxiv.org/abs/2605.16288
作者: Christian Kolb
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 10 pages, 4 figures. Exploratory study
Abstract:Large language models (LLMs) are increasingly used in clinical and care settings. This exploratory study investigates whether LLMs exhibit sycophantic behavior - adapting their responses to social expectation signals rather than maintaining professional quality - in the context of dementia care. Five prompts with systematically increasing confirmatory and authority-related framing (P1 neutral to P5 authority-signaled implementation support) were submitted to four LLMs (GPT-5, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Large), each repeated five times (N = 100 responses). Responses were evaluated using an LLM-as-a-Judge methodology against seven nursing-ethical quality criteria (K1-K7) and a tone scale (0-3). All models showed significant negative Spearman correlations between prompt level and response quality (rho ranging from -0.543 to -0.734, all p 0.01). Mistral Large exhibited the most pronounced effect (rho = -0.734), with mean scores dropping from 6.0/7 at P1 to 0.2/7 at P5. The findings suggest that LLMs pose context-sensitive risks in high-stakes care environments and that prompt framing significantly shapes response quality - a dimension that has received insufficient attention in healthcare AI deployment.
[NLP-180] AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course
【速读】: 该论文试图解决的问题是:在英语作为学术用途(EAP)教学中,利用生成式AI检索增强生成(RAG)工具大规模创建的多媒体补充材料,究竟是作为一种有效的教学支架(pedagogical scaffold)还是低质量、高数量的“AI垃圾”(AI slop)。解决方案的关键在于:教师主动设计提示词(teacher-prompted),使用Google Notebook LM等工具生成视频、播客、信息图和个性化反馈报告,并通过混合方法研究(包括问卷调查、半结构化访谈及成绩相关性分析)验证这些材料对学生学习体验与成效的影响。研究发现,学生普遍认为这些材料具有高实用性与易用性,尤其偏好与评估挂钩的视觉和多模态内容(如视频和信息图),且视频偏好与学业成绩呈正相关;但认知负荷过高则显著负向影响成绩,说明材料复杂度需精准调控。此外,部分低分学生自发将AI生成内容作为补救性学习支架,表明当内容契合学生目标并符合认知负荷理论时,RAG技术可实现规模化个性化反馈,从而实质性提升EAP教学生态,而非产生无效内容。
链接: https://arxiv.org/abs/2605.16275
作者: David James Woo,Deliang Wang,Kai Guo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 23 pages, 7 figures
Abstract:Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students’ preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.
[NLP-181] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning ICML2026
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在复杂推理任务中依赖大量高质量人工标注数据的问题,这种依赖使得监督微调(Supervised Fine-Tuning, SFT)和基于特定推理数据的强化学习(Reinforcement Learning, RL)方法难以持续扩展。为突破这一瓶颈,作者提出了一种无需外部训练数据的协作式教练-玩家(Coach-Player)范式——CPMöbius,其核心创新在于将教练与玩家视为独立但协同的角色:教练通过生成针对性指令来提升玩家的能力,并根据玩家性能变化获得奖励;而玩家则因成功完成由教练设计的任务而获得奖励。该合作优化循环直接增强玩家的数学推理能力,实验表明该方法在不使用任何外部数据的情况下显著优于现有无监督方法,在Qwen2.5-Math-7B-Instruct模型上整体准确率提升4.9个百分点,分布外(OOD)准确率提升5.4个百分点,超越RENT和R-zero等基线方法。
链接: https://arxiv.org/abs/2602.02979
作者: Ran Li,Zeyuan Liu,Yinghao Chen,Bingxiang He,Jiarui Yuan,Zixuan Fu,Weize Chen,Jinyi Hu,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026
Abstract:Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player’s capability and receives rewards based on changes in the Player’s performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player’s mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy. Our codebase has been released at this https URL.
信息检索
[IR-0] raditional statistical representations outperform generative AI in identifying expert peer reviewers
链接: https://arxiv.org/abs/2605.18752
作者: Vicente Amado Olivo,Tereza Jerabkova,Jakub Klencki,John Carpenter,Mario Malički,Ferdinando Patat,Louis-Gregory Strolger,Wolfgang Kerzendorf
类目: Information Retrieval (cs.IR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Digital Libraries (cs.DL)
备注:
Abstract:The exponential growth of scientific submissions has strained the peer review system. Despite the rapidly expanding global pool of researchers, this unprecedented scale has rendered the previous approach of manual expert identification unfeasible. Therefore, institutions have naturally turned to Large Language Models (LLMs) to automate intricate processes like expert reviewer identification. However, the reliability of these new models in accurately identifying domain experts lacks rigorous evaluation. We conduct a comprehensive empirical evaluation of statistical and AI-driven expertise identification methodologies to benchmark their reliability and limitations. Framing expert identification as an information retrieval problem, we utilize the distributed peer review system of a major international astronomical observatory, where proposal authorship serves as our proxy ground truth for domain expertise. Evaluating six retrieval methodologies utilized across observatories and computer science conferences, we demonstrate that traditional statistical representations outperform generative AI. Specifically, Term Frequency-Inverse Document Frequency successfully identified a labeled expert within the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. Our results highlight that distinguishing subfield expertise requires fine-grained vocabulary, which is obscured by the semantic smoothing in generative methods. By establishing a rigorous evaluation framework for automated peer review, we demonstrate that transparent and reproducible statistical representations still outperform computationally expensive LLMs in specialized scientific tasks.
[IR-1] Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
链接: https://arxiv.org/abs/2605.18561
作者: Santosh Kumar Radha,Oktay Goktas
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 19 pages, 12 figures. Code and artifacts: this https URL
Abstract:In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25’s logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Spärck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L’Hôpital’s rule, and for q1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p = 10^-4). The effect is graded across code languages and is near-zero on BEIR text. A one-parameter closed form estimates a corpus-level q from hapax density and stays near q=1 on corpora where BM25 is already optimal. The index-time cost is a single pass over the sparse score matrix and query latency is unchanged. A tokenizer ablation shows that identifier-aware tokenization largely removes the incremental gain from q-IDF.
[IR-2] Vector RAG vs LLM -Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research
链接: https://arxiv.org/abs/2605.18490
作者: Theodore O. Cochran
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:We preregistered a comparison of two ways to help an LLM answer questions over a small research corpus: a single-round Vector RAG system and an LLM-compiled markdown wiki. Both systems answered the same 13 questions over 24 papers using the same answer-generating model, and their answers were scored by blinded LLM judges. The wiki scored much better at connecting findings across papers, but its advantage in answer organization was not strong after judge adjustment. RAG met the preregistered test for single-fact lookup questions. The clean query-side cost result went against the expected wiki advantage: under the tested setup, the wiki used far more query tokens than RAG, so it could not recover any upfront build cost through cheaper queries. Two exploratory analyses changed how we interpret the result. First, claim-level citation checking favored the wiki: its cited pages more often supported the exact claims being made, even though RAG scored better on the overall groundedness rubric. Second, a decomposition-based RAG variant recovered most of the wiki’s advantage on cross-paper synthesis at lower LLM-token cost, but it did not recover the wiki advantage in claim-by-claim citation support. The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run. In this study, no architecture was best on all three. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2605.18490 [cs.CL] (or arXiv:2605.18490v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.18490 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Theodore Cochran [view email] [v1] Mon, 18 May 2026 14:41:16 UTC (19 KB)
[IR-3] IGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval
链接: https://arxiv.org/abs/2605.18434
作者: Xinyu Sun,Huangyu Dai,Lingtao Mao,Zexin Zheng,Zihan Liang,Ben Chen,Chenyi Lei,Wenwu Ou
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity – a visual query must match image–text items, and a granularity disparity – a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query–item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.
[IR-4] SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
链接: https://arxiv.org/abs/2605.18299
作者: Yufei Ma,Zihan Liang,Ben Chen,Zhipeng Qian,Huangyu Dai,Lingtao Mao,Xuxin Zhang,Chenyi Lei,Wenwu Ou
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen–Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO’s coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.
[IR-5] From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG ICML2026
链接: https://arxiv.org/abs/2605.18271
作者: Changmin Lee,Jaemin Kim,Taesik Gong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ICML 2026. Code and data are available at this https URL
Abstract:With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 20.17 percentage points, and achieves 33.33 times lower retrieval latency over the best-performing baseline. In our on-device experiment, EPIC maintains a memory footprint under 1 MB with 29.35 ms/query latency in streaming updates.
[IR-6] RCTEA: Richness-guided Co-training for Temporal Entity Alignment
链接: https://arxiv.org/abs/2605.18255
作者: Jiayun Li,Wen Hua,Shiqi Fan,Fengmei Jin,Haiyang Jiang,Xue Li
类目: Information Retrieval (cs.IR)
备注:
Abstract:Temporal Entity Alignment (TEA), which aims to identify equivalent entities across Temporal Knowledge Graphs (TKGs), is crucial for integrating knowledge facts from multiple sources. However, existing TEA models often fail to capture the orthogonal yet complementary effects between structural and temporal features, and typically overlook the importance of information richness, a key factor for effective message passing in neural feature encoders. To address these limitations, we propose the RCTEA framework, which jointly models both structural and temporal aspects of TKGs for entity alignment. Specifically, we design a richness-guided attention mechanism along with an adaptive weighting strategy to facilitate effective feature fusion. To ensure robust alignment despite noisy entity contexts, we introduce a dual-view neighborhood consensus algorithm that jointly refines the feature encoders to enforce local structural consistency of the predicted alignments. Extensive experiments demonstrate the superiority of RCTEA, achieving state-of-the-art performance on public TEA benchmarks.
[IR-7] SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
链接: https://arxiv.org/abs/2605.18232
作者: Khalid Yusuf Dahir
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 6 figures, 6 tables. Code: this https URL Dataset: this https URL
Abstract:Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2’s “cleaned” Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4’s cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.
[IR-8] PIPER: Content-Based Table Search via profiling and LLM -Generated Pseudoqueries
链接: https://arxiv.org/abs/2605.18199
作者: Riccardo Terrenzi,Matteo Falconi,Serkan Ayvaz,Pierluigi Plebani
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, accepted at DEXA’26
Abstract:The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.
[IR-9] An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments
链接: https://arxiv.org/abs/2605.18133
作者: Hongjang Yang,Hyunsik Na,Daeseon Choi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 9 pages, 2 figures
Abstract:LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user’ s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent’ s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker’ s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.
[IR-10] Modality-Aware Identity Construction and Counterfactual Structure Learning for ID-Free Multimodal Recommendation
链接: https://arxiv.org/abs/2605.18044
作者: Hongjian Ma,Wenxin Huang,Yan Zhang,Zhifei Li,Zheng Wang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 11 pages, 5 figures, submitted to IEEE Transactions on Multimedia
Abstract:Multimodal recommendation has attracted extensive attention by leveraging heterogeneous modality information to alleviate data sparsity and improve recommendation accuracy. Existing methods have attempted to replace ID embeddings with multimodal features and have achieved promising preliminary results. However, these methods still exhibit the following two limitations: (1) the reconstructed ID representations remain relatively static and fail to fully exploit multimodal semantics; and (2) the graph learning process is insufficient in mining latent long-tail semantic relations and is easily affected by popularity bias. To address these issues, we propose a novel method named Modality-Aware Identity Construction and Counterfactual Structure Learning for ID-free Multimodal Recommendation (MAIL). Specifically, we design a modality-aware identity construction module that dynamically modulates positional encodings with multimodal semantics to construct content-aware ID-free identity representations. Then, we propose a counterfactual structure learning paradigm that mines low-exposure semantic neighbors via popularity penalization and alleviates popularity bias. Extensive experiments are conducted on five public Amazon datasets. Experimental results show that MAIL achieves average improvements of 7.81% in Recall@10 and 12.81% in NDCG@10 compared with the baseline models. Our code is available at this https URL.
[IR-11] owards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search
链接: https://arxiv.org/abs/2605.17994
作者: Yifan Wang,Yixuan Wang,YiDan Liang,Qiang Liu,Fei Xiao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:New item growth is critical for maintaining a healthy ecosystem in large-scale e-commerce platforms. However, existing systems tend to prioritize presenting users with already popular items, a phenomenon often referred to as the “Matthew effect”. In the context of search retrieval, current cold-start models suffer from the misalignment between training objectives and online business metrics, and they lack effective mechanisms to measure an item’s growth potential. In this paper, we propose a Multi-Value-Aware retrieval framework tailored for e-commerce search, designed to better align with the cascaded online values across different stages of the search system while balancing immediate conversion and long-term item growth. Our framework GrowthGR consists of two key components: an Item Long-term Transaction Value Prediction (ItemLTV) module and a Multi-Value-Aware Generative Retrieval (MultiGR) module. First, in the ItemLTV module, we employ counterfactual inference to quantify the long-term value increment attributable to a single user interaction. Second, in the MultiGR module, building upon a semantic-ID-based generative retrieval architecture, we leverage structured samples with the search cascade signals and adopt a Multi-Value-Aware Policy Optimization (MoPO) training paradigm to align with multi-stage online values, while explicitly balancing short-term transactional value and long-term growth potential estimated by ItemLTV. We successfully deployed GrowthGR on Taobao’s production platform, achieving a substantial 5.3% lift in new item GMV while delivering a non-trivial 0.3% gain in overall search GMV. Extensive online analysis and A/B testing demonstrate its positive impact on the overall ecosystem value.
[IR-12] xt-Video Retrieval With Global-Local Contrastive Consistency Learning
链接: https://arxiv.org/abs/2605.17959
作者: Xiaolun Jing,Xinxing Yang,Genke Yang
类目: Information Retrieval (cs.IR)
备注:
Abstract:Text-video retrieval aims to find the most semantically similar videos with given text queries. However, since videos contain more diverse content than texts, the main semantics expressed by each text-video pair is often partially relevant. The primary methods involve the utilization of language-video attention module to align texts and videos. Though effective, this paradigm inevitably introduces prohibitive computational overhead, resulting in inefficient retrieval. In this paper, we propose a simple yet effective method called Global-Local Contrastive Consistency Learning (GLCCL) to achieve texts and videos semantics alignment. Specifically, we design a parameter-free Global-Local Interaction Module (GLIM) to generate semantic-related frame and video features in a text-guided manner. Furthermore, a Contrastive Score Consistency (CSC) loss is developed to promote consistency learning among different scores on positive pairs and suppress consistency learning on negative pairs. Empirical evidence suggests that CSC loss provides the model with robust discriminative power between positives and hard negatives. Extensive experiments on three benchmark datasets, including MSR-VTT, DiDeMo and VATEX, demonstrate the effectiveness and superiority of our approach.
[IR-13] Agent ic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap
链接: https://arxiv.org/abs/2605.17903
作者: Akash Kumar Panda,Olaoluwa Adigun,Bart Kosko
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 15 pages, 6 figures
Abstract:We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces “de-chunked” or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison’s “Thucydides Trap” model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power’s ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.
[IR-14] DADF: A Distribution-Aware Debiasing Framework for Watch-Time Regression in Recommender Systems
链接: https://arxiv.org/abs/2605.17863
作者: Yiqing Yang,Xinlong Zhao,Zhao Liu,Xiao Lv,Ruiming Tang,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
备注: 12 pages, 7 figures, 3 tables
Abstract:Watch-time prediction is a central regression task in short-video recommender systems, where labels are highly long-tailed and residual errors vary systematically across observed watch-time regions. In practice, a model may appear globally calibrated while still overestimating short views and underestimating long views, because opposite errors cancel out in aggregate. Existing methods mainly improve the first-stage watch-time predictor, but often leave such residual distributional bias insufficiently corrected. We propose DADF, a distribution-aware debiasing framework for watch-time regression. Instead of replacing a deployed predictor, DADF performs second-stage multiplicative residual correction on top of it. DADF combines three complementary designs: a dynamic distribution-aware transformation for stabilizing long-tailed correction targets, a debias-factor-aware module for modeling heterogeneous residual patterns using inference-time observable factors, especially video duration, and a multi-label-aware module that exploits auxiliary prediction signals from engagement heads. We evaluate DADF on public short-video benchmarks and a large-scale industrial ranking system. DADF consistently improves both pointwise accuracy and ranking quality across datasets and backbones. In the industrial setting, it achieves a 1.88 percentage-point WUAUC gain over the production baseline, reduces MAE by 12.57%, and yields a statistically significant 0.347% lift in average time spent per device in online A/B testing. These results demonstrate that DADF effectively mitigates local calibration bias and provides a practical plug-in solution for debiasing long-tailed continuous targets. The source code is available at this https URL.
[IR-15] Uncertainty-Calibrated Recommendations for Low-Active Users KDD KDD2026
链接: https://arxiv.org/abs/2605.17788
作者: Bob Junyi Zou,Sai Li,Tianyun Sun,Wentao Guo,Qinglei Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to the Applied Data Science (ADS) track at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
Abstract:A fundamental challenge in recommender systems is balancing reliability for Low-Active Users (LAUs) with diversity for High-Active Users (HAUs). The key to this balance lies in quantifying model uncertainty, which approximates the risk of prediction errors and reveals the limits of the model’s current knowledge. On large-scale short-video and livestream platforms, model uncertainty can warn of low-quality recommendations that may lead to disengagement of LAUs and at the same time identify opportunities to diversify content recommendation for HAUs. To leverage this dichotomy, we introduce a unified, production-ready framework that calibrates uncertainty to drive differentiated strategies. Specifically, we implement a model-uncertainty-based risk-averse deboosting policy for LAUs to suppress unreliable recommendations, while employing a risk-seeking Upper Confidence Bound (UCB) strategy for HAUs to encourage exploration. Validated on a major livestream platform, our framework demonstrates significant improvements in retention (active hours) and satisfaction (quality watch time ratio) for LAUs as well as remarkable increases in interest diversity and category coverage for HAUs, proving the value of uncertainty-aware recommendation in industrial settings.
[IR-16] MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation ACL2026
链接: https://arxiv.org/abs/2605.17640
作者: Debashish Chakraborty,Dengjia Zhang,Jialiang Jin,Hanting Liu,Katherine Guerrerio,Hanxiang Qin,Tyler Skow,Alexander Martin,Reno Kriz,Benjamin Van Durme
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an oral presentation at the ACL 2026 Workshop MAGMaR Systems. 27 pages, 4 figures. Code can be found here: this https URL
Abstract:Retrieval-augmented generation from videos requires systems to retrieve relevant audiovisual evidence from large corpora and synthesize it into coherent, attributed text. Current approaches struggle at both ends: retrieval methods fail on complex, multi-faceted queries that cannot be captured by a single embedding, while generation methods lack the high-level reasoning needed to synthesize across multiple videos and face memory constraints over long, multi-video contexts. We present MARQUIS: a three-stage pipeline that addresses these limitations through (1) query expansion, fusion, and reranking, (2) calibrated structured evidence extraction, and (3) article generation from extracted evidence, optionally controlled by an RLM. On the MAGMaR2026 shared task, we improve retrieval performance from 0.195 to 0.759 (nDCG@10). For article generation, ITER-QA-BASE improves average human score from 3.09 to 3.83 over the CAG baseline, while MARQUIS-RLM achieves a human score of 3.30 and the strongest citation recall among non-QA systems.
[IR-17] mporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations
链接: https://arxiv.org/abs/2605.17639
作者: Volodymyr Ovcharov
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 8 figures, 4 tables. Dataset: this https URL
Abstract:Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (100K citations) resist decay, but mid-frequency articles (1K-10K) – the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at this https URL.
[IR-18] Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP LREC2026
链接: https://arxiv.org/abs/2605.17442
作者: Zhiyin Tan,Changxu Duan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)
Abstract:Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (this https URL).
[IR-19] IVF-TQ: Streaming-Robust Approximate Nearest Neighbor Search via a Codebook-Free Residual Layer
链接: https://arxiv.org/abs/2605.17415
作者: Tarun Sharma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:We propose IVF-TQ, an IVF index with a codebook-free residual layer: a fixed random rotation followed by precomputed Lloyd-Max scalar quantization depending only on (b, d). Only the IVF coarse partition is trained. Building on TurboQuant (Zandieh et al., 2025), the design substantially reduces a key failure mode of trained-codebook ANN indexes (PQ, OPQ, ScaNN): staleness under streaming this http URL (3 seeds): Per-batch PQ retraining does not recover the streaming gap at any tested bit budget (paired-t p 0.28 everywhere). On streaming Deep-10M, IVF-TQ holds at 87.4% - 86.6% (Delta = -0.80 +/- 0.10pp) while IVF-PQ degrades -3.23pp. A shuffled-i.i.d. control on SIFT-1M shows IVF-PQ losing -3.9pp without distribution shift. At higher PQ bit budgets (~1.5x IVF-TQ memory), absolute recall favors PQ as expected from rate-distortion (+6.1pp Deep-10M; +2.0pp SIFT-10M); the durable IVF-TQ benefit is operational (no codebook to retrain), robust across memory this http URL art: IVF around a codebook-free residual quantizer is architecturally not new – IVF-RaBitQ ships in Milvus, cuVS, LanceDB, Weaviate; Shi et al. (2026) is concurrent GPU work. TurboQuant itself tests only flat-rotation this http URL: (i) A multi-seed streaming-operational story for codebook-free IVF: 10M-scale evidence across PQ memory budgets. (ii) A uniform-over-sphere IP-error bound for the TQ residual quantizer with one fixed rotation (proof sketch in v1; rigorous in v2). (iii) Adaptive IVF-TQ: a partition-only refresh recovering 67% - 97.8% under worst-case rotation shift with re-ranking (90.3% without).Code, data: this https URL
[IR-20] xt-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation KDD KDD2026
链接: https://arxiv.org/abs/2605.17366
作者: Yufei Guo,Jing Ma,Tianlu Zhang,Shijie Yang,Yanlong Zang,Weijie Ding,Pinghua Gong,Jungong Han
类目: Information Retrieval (cs.IR)
备注: 12 pages, 5 figures. Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). Pre-camera-ready version
Abstract:Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate their contributions under noisy inputs. Experiments on large-scale, real-world e-commerce datasets with full-pool retrieval show that TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs. On average, it improves Hit Rate@100 (H@100) by 6.04%, demonstrating the effectiveness of text-guided visual encoding for robust multimodal retrieval.
[IR-21] NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation
链接: https://arxiv.org/abs/2605.17364
作者: Joy Bose
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 17 pages, 2 figures, 7 tables, 1 appendix
Abstract:Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.
[IR-22] Dual-Diffusional Generative Fashion Recommendation SIGIR’26
链接: https://arxiv.org/abs/2605.17357
作者: Mingzhe Yu,Lei Wu,Qianru Sun,Yunshan Ma
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by SIGIR’26
Abstract:Personalized generative recommender systems have emerged as a promising solution for fashion recommendation. However, existing methods primarily rely on implicit visual embeddings from historical interactions, which often contain preference-irrelevant information and result in insufficient user behavior modeling. Moreover, these models typically generate only item images, providing limited interpretability. To address these limitations, we propose DualFashion, a Dual-Diffusional Generative Fashion Recommendation Architecture that jointly models image and text modalities for personalized and explainable recommendation. DualFashion adopts a dual-diffusion Transformer with image and text branches, where structured attribute-level captions and visual outfit information are jointly used as conditioning signals to model user behavior. The proposed architecture produces both fashion item images and textual descriptions, ensuring visual compatibility while providing explicit semantic interpretability. Furthermore, we introduce a text-augmented fine-tuning strategy that enhances generation diversity and enables effective cross-modal knowledge transfer without incurring heavy computational costs. Extensive experiments on iFashion and Polyvore-U across Personalized Fill-in-the-Blank and Generative Outfit Recommendation tasks demonstrate that DualFashion achieves strong performance in behavior modeling, interpretability, and efficiency compared to state-of-the-art methods. Our code and model checkpoints are available at this https URL.
[IR-23] RAG R: Review-Augmented Generative Recommendation
链接: https://arxiv.org/abs/2605.17267
作者: Yingyi Zhang,Junyi Li,Yejing Wang,Wenlin Zhang,Xiaowei Qian,Sheng Zhang,Yue Feng,Yichao Wang,Yong Liu,Xiangyu Zhao,Xianneng Li
类目: Information Retrieval (cs.IR)
备注:
Abstract:Sequential recommendation (SR) is traditionally formulated as next-item prediction over a chronological sequence of interacted items. Although recent generative recommendation (GR) methods introduce new machinery, such as semantic IDs, autoregressive decoding, and unified token spaces, they largely inherit the same item-only modeling assumption. We argue that this design constitutes a structural bottleneck, because user decision-making is not purely behavioral: while item interactions reveal what users choose, review feedback often explain why they choose it by exposing latent evaluative factors. Motivated by this observation, we propose Review-Augmented Generative Recommendation (RAGR), a novel GR framework that incorporates review feedback directly into the generative user sequence rather than treating reviews as auxiliary side information. Specifically, RAGR introduces a Review-Augmented User Sequence Modeling mechanism that interleaves item semantic IDs and review semantic IDs in chronological order to construct a mixed behavioral-semantic sequence, enabling review signals to participate directly in autoregressive next-token generation. To preserve the recommendation objective, we further introduce an Item-Centric Task Generation Alignment strategy based on direct preference optimization (DPO), which encourages the model to favor item tokens over review tokens at prediction positions. Experiments on three real-world datasets show that RAGR yields consistent and significant gains over strong GR backbones across all metrics. Our code and data are available at \urlthis https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2605.17267 [cs.IR] (or arXiv:2605.17267v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.17267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-24] Unlocking Biological Workflows for Robust Protein-Text Question Answering: A Dual-Dimensional RAG Framework
链接: https://arxiv.org/abs/2605.17261
作者: Li Ding,Duanyu Feng,Chen Huang,Yangshuai Wang,Yang Li,Wenqiang Lei,See-Kiong Ng
类目: Information Retrieval (cs.IR)
备注:
Abstract:Protein-Text Question Answering (QA) is crucial for interpreting biological sequences through natural language. The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) that efficiently leverages biological databases and facilitates reasoning offers a potent approach for it. However, constrained by the standard RAG pipeline, these models often rely on curated, static datasets instead of expert-proven biological workflows, lacking the fine-grained information processing and struggling to generalize to novel (OOD) proteins. To bridge this gap, we propose 2D-ProteinRAG, a novel framework that empowers LLMs to operate within the gold-standard biological research workflow (BLAST). To further extract high-quality information from noisy retrieval contexts, we introduce a dual-dimensional (2D) filtering strategy following the expert analytical paradigms. Horizontal Fine-grained Attribute Alignment utilizes a lightweight, intent-aware discriminative filter to prune irrelevant metadata and align database entries with specific user queries. Vertical Homology-based Semantic Denoising resolves functional contradictions and redundancy across multiple homologs via hierarchical clustering. Extensive evaluations on both In-Distribution and diverse biological OOD benchmarks demonstrate that 2D-ProteinRAG consistently achieves state-of-the-art performance, outperforming fine-tuned baselines and other RAG methods. Our results validate the framework’s robustness and scalability, providing a practical solution for interpreting protein functions in real-world scientific scenarios.
[IR-25] Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders
链接: https://arxiv.org/abs/2605.16825
作者: Jun Yin,Bangguo Zhu,Peng Huo,Ruochen Liu,Hao Chen,Senzhang Wang,Shirui Pan,Chengqi Zhang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transforming the recommendation paradigm. Despite their effectiveness, we recognize that GRs are still susceptible to the long-standing issue of popularity bias that has pervaded the recommendation community. Although a few studies have attempted to extend traditional debiasing methods to GRs, their effectiveness is marginal, and the fundamental reason why GRs suffer from popularity bias remains under-explored. To bridge this gap, this study focuses on two core aspects in GRs: the optimization of generative framework and the item tokenization based on semantic index. Based on theoretical analyses, we identify that the severe popularity bias emerges from the confluence of a token-level optimization flaw and the undifferentiated property of item tokenization. Accordingly, this study develops a novel generative recommender system, called Ghost, by designing the asymmetric unlikelihood optimization and the skeleton-founded tokenization. Extensive empirical evaluations across three datasets, alongside multiple SOTA baselines, reveal that Ghost substantially alleviates popularity bias and promotes fairer recommendations, while incurring slight degradation to the overall recommendation utility.
[IR-26] UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation
链接: https://arxiv.org/abs/2605.16750
作者: Xinghe Cheng,Guiyong Zhuang,Yusheng Xie,Jiapu Wang,Yixin Liu,Quanlong Guan,Liangda Fang,Shirui Pan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized exercise recommendation dynamically aligns pedagogical resources with individual knowledge mastery, which is crucial for satisfying students’ dynamic learning needs in modern education. The field is currently driven by two dominant paradigms: Item-Level Exercise Recommendation (ILER) optimizes for immediate single-step state transitions, while Path-Level Exercise Recommendation (PLER) constructs coherent learning paths to maximize cumulative gains. Despite sharing the same ultimate objective, disparate evaluation setups have kept these two lines of research isolated, hindering unified benchmarking and fair comparison. To fill the gap, in this paper, we present a Unified Benchmark for Exercise Recommendation (UniER), a comprehensive evaluation framework that unifies ILER and PLER. Specifically, we introduce Weighted Cognitive Gain (WCG) as a unified metric to measure cross-paradigm algorithmic performance. Our benchmark encompasses 9 datasets spanning four generation methods, facilitating the comparison of 18 representative ILER/PLER methods. Through multi-dimensional analyses covering effectiveness, generalizability, robustness, and efficiency, our results reveal the systematic dominance of PLER and expose the pedagogical failure of ILER’s fragmented recommendations under extreme sparsity and noise. Furthermore, we provide an open-source codebase of UniER to foster reproducible research and outline potential directions for future investigations.
[IR-27] Approximate Distributed Coded Computing: Polynomial Codes and Randomized Sketching
链接: https://arxiv.org/abs/2605.16744
作者: Neophytos Charalambides,Arya Mazumdar
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Signal Processing (eess.SP)
备注:
Abstract:Coded computing is a distributed paradigm that uses coding theory to introduce \textitredundancy and overcome bottlenecks in large-scale systems. In the same vein, randomized numerical linear algebra employs probabilistic methods to \textitcompress and accelerate linear algebraic operations, addressing challenges in high-dimensional data analysis. This article reviews the foundations of both fields and presents distributed schemes that combine techniques from both to speed up optimization and machine learning algorithms, in the presence of slow or non-responsive servers. Along the way, we touch on various related topics and mathematical concepts.
[IR-28] RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification
链接: https://arxiv.org/abs/2605.16535
作者: Lasal Jayawardena,Nirmalie Wiratunga,Ikechukwu Nkisi-Orji,Darren Nicol
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier’s score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query’s label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.
[IR-29] Policy-Grounded Dynamic Facet Suggestions for Job Search
链接: https://arxiv.org/abs/2605.16479
作者: Dan Xu,Baofen Zheng,Qianqi Shen,Jianqiang Shen,Wenqiong Liu,Chunnan Yao,Ping Liu,Rajat Arora,Kevin Kao,Hsiang Lin,Wanjun Jiang,Yusuke Takebuchi,Jingwei Wu,Wenjing Zhang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Job seekers often initiate search with short, underspecified queries. At LinkedIn, over 80% of job-related queries contain three or fewer keywords, making accurate user intent inference and relevant job retrieval particularly challenging. We present dynamic facet suggestion (DFS), an interactive query refinement mechanism that facilitates intent disambiguation by surfacing personalized semantic attributes conditioned on the joint user-query context in real time. We propose a policy-grounded, retrieval-augmented ranking framework for facet suggestion, comprising offline taxonomy curation, embedding-based retrieval of top-K candidates, and distilled small language model (SLM) based candidate scoring. The system is optimized for real-time serving via pointwise single-token scoring with batching and prefix caching. Offline evaluation demonstrates high precision for generated suggestions, and online A/B tests show significant improvements in suggestion engagement and job search outcomes.
[IR-30] LERA: LLM -Enhanced RAG for Ad Auction in Generative Chatbots
链接: https://arxiv.org/abs/2605.16474
作者: Haoran Sun,Xinrui Song,Xinyu Zhang,Zhaohua Chen,Xu Chu,Zhilin Zhang,Chuan Yu,Jian Xu,Bo Zheng,Xiaotie Deng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:The integration of advertising auction mechanisms into large language model (LLM)-based chatbots presents a significant opportunity for commercialization, yet poses unique challenges in balancing relevance, efficiency, and user experience. Recently, Feizi et al.~\citepfeizi2023online and Hajiaghayi et al.~\citephajiaghayi2024ad outlined a retrieve-then-generate paradigm that decouples retrieval and generation, offering lightweight ad insertion and payment determination. However, current retrieval relies solely on text embedding similarity, which may lead to commercial misinterpretation and issues such as repetitive insertions. In this paper, we propose LERA, a two-stage retrieve-then-generate auction framework tailored for LLM chatbots. In the first stage, embedding-based coarse filtering pre-selects a small set of candidate advertisers. In the second stage, the LLM itself is queried with a carefully designed prompt to produce logits over candidates, which serve as refined organic relevance scores. These scores are combined with bids, and a critical-value payment rule accounts for both the coarse-filtering and fine-ranking thresholds, ensuring truthfulness for utility-maximizing advertisers. The framework naturally extends to multiple ad insertions within dynamic dialogue flows and long responses. Experiments on a synthetic advertiser-query benchmark show that LERA substantially improves ad selection accuracy and insertion diversity while incurring only controllable latency overhead.
[IR-31] he Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit
链接: https://arxiv.org/abs/2605.16428
作者: Peibo Zhang,Ruomeng Cui,Dennis J. Zhang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Search engines traditionally complement online content platforms by directing users seeking information to external websites. The emergence of generative AI search tools that summarize answers directly on the results page may disrupt this relationship by making visits to source platforms optional. We study this question using Google AI Overviews and Reddit, one of the largest online discussion platforms. Our identification exploits Google’s content moderation policy: Safe-for-Work (SFW) Reddit communities are indexed by Google organic search and surfaced in Google AI Overviews, while Not-Safe-for-Work (NSFW) communities, though indexed by organic search, are prohibited from being referenced in AI Overview summaries. Using a difference-in-differences design, we find that AI Overviews increase engagement in SFW communities: daily comments rise by 12.0 percent and the number of commenting users by 12.3 percent relative to NSFW communities. The effects are concentrated in experience-based discussions (opinions, advice, and personal experiences) rather than fact-based information. However, the subsequent introduction of Google AI Mode, which allows users to interact conversationally with the AI summary, largely eliminates these gains in experience-based content. These results suggest that the effects of AI search depend critically on interface design and types of content.
[IR-32] LARGER: Lexically Anchored Repository Graph Exploration and Retrieval
链接: https://arxiv.org/abs/2605.16352
作者: Yuntong Hu,Tongli Su,Liang Zhao,Bowen Zhu,Hasibul Haque
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Repository-level coding agents must first localize the files and symbols relevant to a task; failures at this stage can cascade across downstream objectives ranging from patch generation to test writing and codebase question answering. Existing agents navigate repositories primarily through lexical search, often missing structural relations such as imports, call chains, type hierarchies, and code-test links. Graph-based retrieval can recover such dependencies, but existing approaches often require separate graph tools or traversal stages that fragment the agent’s interaction loop. We formalize repository context localization as Lexically Anchored Structural Localization, where success depends on turning lexical matches into high-precision structural entry points and exposing the most useful confidence-filtered local neighborhoods within the agent’s existing search loop. We introduce LARGER (Lexically Anchored Repository Graph Exploration and Retrieval), a lexically anchored active-set retrieval framework that starts from lexical matches, aligns them to graph anchors, and performs confidence-filtered local expansion within the agent’s existing search loop. LARGER integrates directly into existing CLI coding agents without requiring external graph databases or specialized graph interfaces. Across four benchmarks spanning localization, test generation, and codebase understanding, LARGER improves file-level Acc@5 on LocBench by +13.9 points with tuned hyperparameters and still gains +11.8 points with fixed hyperparameters over the strongest baseline, while delivering consistent gains on MuLocBench, SWE-Atlas Test Writing, and SWE-Atlas Codebase QA.
[IR-33] A Production-Ready RL Framework for Personalized Utility Tuning with Pareto Sweeping in Pinterest Recommender Systems
链接: https://arxiv.org/abs/2605.16344
作者: Yichu Zhou,Mehdi Ben Ayed,Lin Yang,Jiacong He,Andreanne Lemay,Jiaye Wang,Jaewon Yang,Josie Zeng,Dhruvil Deven Badani,Yijie Dylan Wang,Jiajing Xu,Charles Rosenberg
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Large-scale recommenders encode multi-objective trade-offs by combining multiple predicted outcomes into a single utility score. Although this utility layer can be updated independently of the ranker, weight tuning remains largely manual, globally applied, slow to adapt to changing environments and business needs, and hard to govern as priorities shift. We propose PRL-PUTS, a Production-ready, ranker independent RL framework for Personalized Utility-weight Tuning with Pareto Sweeping. We cast utility tuning as a one-step, value-based RL problem: given request context, an agent selects a utility-weight vector that re-weights ranker predictions to maximize request-level engagement rewards. To visualize performance across the trade-off spectrum and allow decision makers to update the deployed operating policy instantly, we adopt an inference-time Pareto frontier sweeping via a scalarization parameter, producing a family of policies and an empirical Pareto frontier used as a governance artifact for operating policy selection. PRL-PUTS runs in parallel with ranking inference without adding serving latency. We validate PRL-PUTS with offline analysis using unbiased exploration logs and online experiments on Pinterest Homefeed where PRL-PUTS showed significant increases in engagement compared to baseline such as +0.13% increase in successful session, a core metric for user engagement.
[IR-34] SotA Lens: A Network-Augmented Methodology and Tool for Exploratory State-of-the-Art Reviews
链接: https://arxiv.org/abs/2605.16333
作者: Diogo Peralta Cordeiro
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 11 pages, 3 figures, 2 tables; original methodology/software paper with proof-of-concept case study; software DOI: https://doi.org/10.5281/zenodo.19860899
Abstract:Researchers often begin new projects by conducting a broad State-of-the-Art review before they are ready to define the narrow protocol required by a systematic review. This is especially common in multidisciplinary areas where terminology is unstable, communities are weakly connected, and relevant work is dispersed across technical and application domains. This paper presents SotA Lens, a network-augmented methodology and lightweight software toolkit for exploratory State-of-the-Art reviews. The approach combines documented seed search, DOI-level metadata resolution, bounded citation expansion, directed graph construction, community detection, ranking of authors and subject terms, and human labelling of research communities. It is designed to complement, not replace, established review protocols such as PRISMA, PRISMA-ScR, systematic mapping studies, and bibliometric science mapping. The method is demonstrated through a proof-of-concept review of Dynamic Projection-Mapping and Spatial Augmented Reality. Starting from approximately 200 seed search results, the workflow produced a citation graph with 2,198 DOI-level vertices and 8,249 reference edges; a filtered largest component for 2010-2023 contained 986 vertices, 2,693 edges, and sixteen labelled communities. The contribution is both methodological and practical: SotA Lens helps researchers map broad fields, identify clusters and gaps, and produce auditable review artifacts before committing to a narrower systematic review protocol. This paper is not intended as a domain survey of Dynamic Projection-Mapping or Spatial Augmented Reality; rather, it introduces and demonstrates an original review-support methodology and software artifact using that domain as a proof-of-concept case study.
[IR-35] How algorithmic popularity bias hinders or promotes quality
链接: https://arxiv.org/abs/1707.00574
作者: Azadeh Nematzadeh,Giovanni Luca Ciampaglia,Filippo Menczer,Alessandro Flammini
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Algorithms that favor popular items are used to help us select among many choices, from engaging articles on a social media news feed to songs and books that others have purchased, and from top-raked search engine results to highly-cited scientific papers. The goal of these algorithms is to identify high-quality items such as reliable news, beautiful movies, prestigious information sources, and important discoveries — in short, high-quality content should rank at the top. Prior work has shown that choosing what is popular may amplify random fluctuations and ultimately lead to sub-optimal rankings. Nonetheless, it is often assumed that recommending what is popular will help high-quality content “bubble up” in practice. Here we identify the conditions in which popularity may be a viable proxy for quality content by studying a simple model of cultural market endowed with an intrinsic notion of quality. A parameter representing the cognitive cost of exploration controls the critical trade-off between quality and popularity. We find a regime of intermediate exploration cost where an optimal balance exists, such that choosing what is popular actually promotes high-quality items to the top. Outside of these limits, however, popularity bias is more likely to hinder quality. These findings clarify the effects of algorithmic popularity bias on quality outcomes, and may inform the design of more principled mechanisms for techno-social cultural markets.
人机交互
[HC-0] Contextualized Dynamic Explanations: A Vision
链接: https://arxiv.org/abs/2605.18698
作者: Zhicheng Liu,Jason H Li,Greg Briskin
类目: Human-Computer Interaction (cs.HC)
备注: 1st Workshop on GenAI, Agents, and the Future of VIS, 2025
Abstract:Asynchronous data-driven explanations often fail because the content and presentation are not tailored to the target audience, and they provide limited opportunities for active audience engagement. We present a vision for Contextualized Dynamic Explanations (CODEX), an agentic approach to dynamically generating contextualized multi-modal information interfaces for effective data-driven explanations based on an evolving audience model and a predefined communication intent. The premise underlying CODEX is that it is impossible for communicators to anticipate the full range of interactive scenarios involving the target audience. This observation motivates a set of research challenges focused on developing autonomous agents capable of evaluating communication progress, making context-sensitive decisions, and producing effective information representations.
[HC-1] Evaluating Multi-turn Human-AI Interaction ACL2026
链接: https://arxiv.org/abs/2605.18660
作者: Shi Ding,Sijian Tan
类目: Human-Computer Interaction (cs.HC)
备注: ACL 2026 EvalEval Workshop
Abstract:Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human–AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.
[HC-2] OrganicHAR: Towards Activity Discovery in Organic Settings for Privacy Preserving Sensors Using Efficient Video Analysis
链接: https://arxiv.org/abs/2605.18455
作者: Prasoon Patidar,Riku Arakawa,Ricardo Graça,Rúben Moutinho,Adriano Soares,Ana Vasconcelos,Filippo Talami,Joana Couto da Silva,Inês Silva,Cristina Mendes Santos,Mayank Goel,Yuvraj Agarwal
类目: Human-Computer Interaction (cs.HC)
备注: 23 pages, 14 figures, with a 4-page appendix containing 2 additional figures. To appear in Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. (IMWUT), Vol. 9, No. 4, Article 203 (December 2025). DOI: https://doi.org/10.1145/3770674
Abstract:Deploying human activity recognition (HAR) at home is still rare because sensor signals vary wildly across houses, people, and time, essentially requiring in-situ data collection and training. Prior approaches use cameras to generate training labels for privacy-preserving sensors (LiDAR, RADAR, Thermal), but this forces sensors to detect predefined activities that cameras can see yet the sensors themselves cannot reliably distinguish. In this work, we introduce OrganicHAR, an activity discovery framework that inverts this relationship by placing sensor capabilities at the center of activity discovery. Our approach identifies naturally occurring signal patterns using privacy-preserving sensors, leverages Vision Language Models (VLMs) only during these key moments for scene understanding, and discovers discrete activity labels at granularities that these sensors can reliably detect. Our evaluation with 12 participants demonstrates OrganicHAR’s effectiveness: it achieves 79% accuracy for coarse (4-5) activities using only basic ambient sensors (radar, lidar, thermal arrays), and 73% accuracy for fine-grained (8-9) activities when a wearable IMU, depth, and pose sensor are added. OrganicHAR maintains 77% accuracy on average across configurations while discovering 4-8 categories per user (15 across all users) tailored to each environment and sensor capabilities. By triggering video processing only at key moments identified by local sensors, we reduce queries to VLM by 90%, enabling practical and privacy-preserving activity recognition in natural settings.
[HC-3] he Hidden Cost of Contextual Sycophancy: an AI Literacy Intervention in Human-AI Collaboration
链接: https://arxiv.org/abs/2605.18372
作者: Cansu Koyuturk,Sabrina Guidotti,Dimitri Ognibene
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: SPRINGER AIED 2026: Accepted for LBR, poster presentation at the 27th International Conference on Artificial Intelligence in Education, 27 Jun - 3 Jul 2026, Seoul, Republic of Korea
Abstract:Large Language Models (LLMs) are increasingly used in educational settings as interactive tools for collaboration. However, their tendency toward sycophancy, aligning with user beliefs even when incorrect, raises concerns for learning and decision-making, especially for less knowledgeable users. This study investigates how sycophantic alignment emerges in authentic multi-turn human-AI interactions and whether interventions targeting increasing AI literacy and prompting competencies can mitigate its effects. In a controlled mixed-design experiment, 60 participants completed analytical survival ranking tasks by first generating individual rankings and then making final decisions after collaborating with an AI assistant, both before and after receiving either general or sycophancy-focused prompting training. Preliminary results show that LLMs are highly sensitive to user input: lower-quality initial responses lead to poorer AI advice, suggesting that the model mirrors or incorporates user reasoning rather than correcting it or offering better alternatives that are missing or less frequent in the conversation. Critically, the propagation of user errors into AI responses significantly reduced both the quality of AI feedback and final user task performance, revealing a form of contextual sycophantic dependence. While the intervention did not eliminate the propagation of contextual errors, it significantly improved AI advice by reducing the direct mirroring of incorrect user rankings. These findings suggest that prompting and AI literacy alone may be insufficient to ensure epistemically independent AI support, highlighting the need for system-level approaches that better promote critical engagement in human-AI collaboration.
[HC-4] Distorted Perspectives of LLM -Simulated Preferences: Can AI Mislead Design?
链接: https://arxiv.org/abs/2605.18311
作者: Eduard Kuric,Peter Demcak,Matus Krajcovic
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Designers of digital solutions increasingly consult Large Language Models (LLMs) for their work. However, it remains unclear how this may affect the user experiences they produce and there are no established practices. We investigate how design preferences expressed by LLM-driven simulation methods align with those of real users. We present a study that aggregates real-world data and design stimuli from twenty-nine preference tests conducted in practice by users of the UXtweak online research platform (n = 2073). We perform holistic multimodal simulations where we manipulate LLM variables (model reasoning, sampling, persona type, and specificity) and assess their effects on algorithmic fidelity. Our results unveil significant and systematic discrepancies between peoples’ real design preferences and LLM simulations that are consistent across manipulations. Synthetic justifications lack genuine depth, nuance and reasoning, which they substitute by patterns like focus on generic properties, specific elements, elaboration and overpraising. The unique attention directed by this research toward preferences within visual design stimuli highlights misrepresentation of perception and meaning by LLMs in a context that is intuitive yet critical for design teams. The external and ecological validity of our findings is high, given their replication across a multitude of real-world studies.
[HC-5] What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience
链接: https://arxiv.org/abs/2605.18302
作者: Eduard Kuric,Peter Demcak,Matus Krajcovic
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Synthetic participants represent a methodologically concerning concept that threatens the integrity of UX research. Findings from previous experiments specify how AI outputs are misaligned with the behaviors and thoughts of real humans in various ways. However, industry voices keep underestimating their severity, advocating for practical compromises where good-enough data does not need to be perfect, and all issues will be solved by future tuning. Our study tackles the lack of systematic understanding of the practical issues that arise with synthetic behavior and its use for steering decisions within real contexts. Within twelve diverse first click tests (n = 3431) obtained from real UX practice, we examine the ability of GPT to predict where humans click and how they reason about their behavior. Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes. Participant personas, chain-of-thought reasoning in GPT, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. We expose a multitude of nuanced distortions in synthetic responses that reduce their overall analytical usefulness as a decision-making resource, compared with real data. Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.
[HC-6] DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG
链接: https://arxiv.org/abs/2605.18298
作者: Yang Shao,Peiliang Gong,Qun Dai,Daoqiang Zhang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 22 pages, 10 pages of main text + 12 pages of appendices
Abstract:Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.
[HC-7] MEEDAV: A Synchronous Web Viewer for EEG Eye-Tracking and Speech Data
链接: https://arxiv.org/abs/2605.18296
作者: Jan Pijálek,Karel Vlk,Ondřej Bojar
类目: Human-Computer Interaction (cs.HC)
备注: Presented at ACAIN 2025 (Advanced Course Symposium on Artificial Intelligence Neuroscience), recipient of the Camillo Golgi Best Paper Award
Abstract:MEEDAV is an open-source web-based application for the synchronised visualisation of electroencephalography (EEG), eye-tracking, and audio data collected in psycholinguistic research. While originally developed for the Eyetracked Multi-Modal Translation (EMMT) corpus, which uses four-channel EEG data from the Muse 2 headband, MEEDAV also supports higher-density EEG setups thanks to its channel-agnostic processing pipeline. The system performs time alignment across all modalities and provides optional ICA-based EEG denoising. It features interactive Plotly visualisations, including unified EEG-audio-gaze timelines, gaze-intensity plots, event markers, and spatial heatmaps of fixation/saccade patterns. Researchers can filter by participant and stimulus, inspect raw versus cleaned signals, and compute cross-modal correlations. All processing is handled in real time, with a modular backend that supports local file access or GitHub-based streaming. Although initially tailored to the structure of the EMMT dataset, MEEDAV demonstrates a generalisable approach to multimodal data exploration and offers a lightweight, browser-accessible solution for cognitive neuroscience and translation studies.
[HC-8] In-Vehicle Human-Machine Interface to Support Drivers in Conditionally Automated Platooning
链接: https://arxiv.org/abs/2605.18149
作者: Anna-Lena Hager,Mohamed Sabry,Walter Morales-Alvarez,Selena Möhrlein,Cristina Olaverri-Monreal
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Vehicle platooning enables close-gap driving and offers potential benefits for traffic efficiency and safety. In conditionally automated platooning, drivers remain responsible for supervising the system and intervening when necessary, making effective Human-Machine Interfaces (HMIs) critical for maintaining situational awareness and stable driver-automation coordination. This paper investigates whether an in-vehicle HMI providing continuous system-state and inter-vehicle distance information improves supervisory behavior, safety, and platoon stability. We conducted a simulation-based experiment integrated with a 6-degree-of-freedom motion system to enhance scenario realism. Dependent variables included collision occurrence, response latency following platoon disconnection, and the number of manual interventions during intact platooning. Results showed significantly fewer manual interventions when the HMI was active, with intervention rates about 80% higher without it. No significant effects were found for collision occurrence or response latency, indicating that additional information improves supervisory stability during platooning but does not substantially affect emergency reactions or collision rates. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.18149 [cs.HC] (or arXiv:2605.18149v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.18149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-9] A Brief Overview: On-Policy Self-Distillation In Large Language Models
链接: https://arxiv.org/abs/2605.18141
作者: Fangming Cui,Sunan Li,Jiahong Li
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:On-Policy Self-Distillation (OPSD) introduces a unified learning framework in which a single large language model simultaneously serves as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.
[HC-10] Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users
链接: https://arxiv.org/abs/2605.18036
作者: Alfio Ventura,Tim Katzke,Jan Corazza,Mustafa Yalçıner
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint. Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI 2026). Final version to appear in the conference proceedings
Abstract:Trust calibration – aligning user trust judgment with model capability – is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.
[HC-11] PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows ACL2026
链接: https://arxiv.org/abs/2605.18032
作者: Kazuki Kawamura,Satoshi Waki,Kei Tateno
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations
Abstract:Multi-agent LLM workflows – systems composed of multiple role-specific LLM calls – often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.
[HC-12] See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
链接: https://arxiv.org/abs/2605.18018
作者: Boyuan Sun,Bowen Yin,Yuanming Li,Xihan Wei,Qibin Hou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \hrefthis https URLthis https URL.
[HC-13] Low Latency Gaze Tracking via Latent Optical Sensing
链接: https://arxiv.org/abs/2605.17990
作者: Yidan Zheng,Matheus Souza,Kaizhang Kang,Qiang Fu,Hadi Amata,Wolfgang Heidrich
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.
[HC-14] GREEN GRID: A Web-Based E-Waste Recycling Platform
链接: https://arxiv.org/abs/2605.17924
作者: Yashodip Jagtap,Aaditya Bagul,Om Kothawal,Puja Patil
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Electronic waste (e-waste) is one of the fastest-growing waste streams worldwide due to rapid technological advancements and shorter device lifespans. Improper disposal releases hazardous substances that harm the environment and human health, while valuable materials such as gold, copper, and aluminum are lost if not recycled. In 2022, approximately 62 million metric tonnes of e-waste were generated globally, but only about 22% was formally recycled. India generated around 1.751 million metric tonnes in 2023-24, with only 43% processed through authorized channels. Green Grid is a full-stack web-based platform designed to simplify and encourage e-waste recycling through an E-Dumper Locator, Green Rewards System, Insights and Awareness Hub, Scheduled Pickup Service, Recycling Impact Calculator, Eco AI Assistant, and Eco-Marketplace. Developed using this http URL, this http URL, this http URL, SQL, Google Maps API, and JWT authentication, the platform transforms e-waste recycling into a transparent, educational, and rewarding process. By combining technology, awareness, and incentives, Green Grid promotes responsible disposal and supports circular economy practices for a more sustainable future.
[HC-15] Multi-site PPG: An In-the-Wild Physiological Dataset from Emerging Multi-site Wearables
链接: https://arxiv.org/abs/2605.17859
作者: Jiayi Shao,Jiaying Ye,Shengyao Liu,Zachary Englhardt,Girish Narayanswamy,Vikram Iyer,Qiuyue(Shirley)Xue
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 20 pages, 6 figures, 11 tables. Dataset and code available at the URLs in the paper
Abstract:Wearables are widely used for mobile health monitoring, and photoplethysmography (PPG) is a key sensing modality for heart rate and related physiological measurements. However, public in-the-wild PPG datasets remain largely wrist-centric or limited to short, controlled studies, constraining research on emerging wearable form factors. We present Multi-site PPG, an in-the-wild physiological dataset collected from four custom-developed unobtrusive wearables: a smart earring, ring, watch, and necklace. Each device records green and infrared reflective PPG, 3-axis acceleration, and temperature with timestamps for cross-device alignment, while a Polar H10 chest strap provides reference electrocardiogram (ECG). Participants wore the devices for multiple days during daytime activities while continuing their normal routines. The dataset contains over 350 hours of raw data and 230-290 hours of modeling-ready 8-second windows per wearable. We benchmark heuristic, supervised, and self-supervised heart-rate estimation methods, showing substantial body-site differences: the best methods achieve mean absolute errors (MAEs) of 2.30 bpm on the earring, 5.13 bpm on the ring, 8.37 bpm on the watch, and 8.68 bpm on the necklace. We further analyze motion effects and evaluate multi-site and PPG-accelerometer fusion, demonstrating the dataset’s value for robust physiological sensing across emerging wearable form factors.
[HC-16] owards SocratiCode: Designing a Generative AI-Based Programming Tutor for K-12 Students through a 4-Week Participatory Design Study
链接: https://arxiv.org/abs/2605.17857
作者: Cassandra Lucas,Anshul Bihani,Rohini Kukka,Chun-Hua Tsai,Jaydeb Sarker,Mia Mohammad Imran
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI creates new opportunities for programming education, but many existing systems remain overly directive, producing lengthy explanations and premature solutions that can overwhelm K-12 novices. In this paper, we present a participatory design study of how an adaptive tutorial system, SocratiCode, evolved toward a Socratic tutoring model for beginner programming instruction. Drawing on weekly learner feedback, we iteratively refined the system over a four-week study with two K-12 students learning Python. Across iterations, the system shifted from flexible tutorial generation toward a more dialogic form of support characterized by guided questioning, reflection prompts, misconception checks, incremental hints, and mandatory pauses for learner input. Our preliminary observations suggest that this Socratic shift improved explanation clarity, supported problem-solving engagement, and better aligned instruction with novice learners’ needs, especially when combined with human guidance. We argue that generative AI in K-12 programming education may be most effective not as an answer engine, but as a Socratic, adaptive learning companion embedded within a human-guided instructional framework.
[HC-17] A Collaborative Rehabilitation-Exercise Serious Game for People with Stroke and their Caregivers: A Pilot Study ALT
链接: https://arxiv.org/abs/2605.17841
作者: Elizabeth D. Vasquez,Jonathan Siskind,Marion S. Buckwalter,Maarten G. Lansberg,Sean Follmer,Allison M. Okamura
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 7 figures, 4 tables, Serious Games and Applications for Health (SeGAH 2026)
Abstract:Motivation to perform movement therapy and caregiver burnout are major challenges to post-stroke life. Serious games have been shown to support therapeutic tasks in people with stroke, but there are few activities that simultaneously support informal caregiver health, which is also impacted post-stroke. Here, we present a collaborative, mutually beneficial, serious game designed to support therapy for persons with stroke and also exercise for their informal caregivers. One player performs rehabilitative wrist movements - useful to people with stroke - and the other performs a seated march exercise - useful to informal caregivers - via pedals or a keyboard to control their avatar. We present a pilot study with 6 healthy dyads to evaluate how exercise-based input of one player, the Pseudo Caregiver (PCG), impacts motivation and emotional experience in both the PCG and Pseudo Person with Stroke (PPS). While not statistically significant, we find that PCGs Interest subscale scores trended higher when using a pedal (the exercised-based input) compared to a keyboard, regardless of game play mode. PPSs’ positive affect scale scores and Competence subscale scores trended higher when their partner played collaboratively with a pedal compared to a keyboard. These trends encourage future work toward incorporating an exercise-based device, such as a pedal, to enhance the emotional and motivational experience of rehabilitative serious games for people with different movement ability levels.
[HC-18] Agents for Experiments Experiments for Agents : A Design Grammar for AI-Enabled Experimental Science
链接: https://arxiv.org/abs/2605.17746
作者: Yingjie Zhang,Chun Feng,Weizhang Zhu,Tianshu Sun
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.
[HC-19] UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation CVPR2026
链接: https://arxiv.org/abs/2605.17742
作者: Tianhao Han,Haoyang Zhang,Liang Xie,Haochen Chang,Kun Gao,Yuan Cheng,Pengfei Ren,Erwei Yin
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by CVPR 2026
Abstract:Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).
[HC-20] LITE-SOC: Lightweight Security Operations Center Simulator for Cybersecurity Education
链接: https://arxiv.org/abs/2605.17703
作者: Martin Higgins,Shawn Thompson,Cherry Mangla
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures. Accepted at the 2026 IEEE Frontiers in Education Conference (FIE 2026)
Abstract:This innovative practice WIP paper describes \emphLITE-SOC, a lightweight web-based Security Operations Center (SOC) simulator designed for instructor-led cybersecurity education. SOC analysts must triage large volumes of alerts, separate genuine threats from false positives, and communicate decisions under time pressure. Recreating this environment in the classroom is difficult and often impractical for institutions without access to cyber ranges or enterprise security infrastructure. LITE-SOC was developed to provide a simpler alternative. The platform generates continuous streams of synthetic SOC events and offers separate student and instructor views with visualization tools, event annotation, and region-based chat. Instructors control the pacing of the exercise and can inject targeted incidents to guide the scenario. The goal is to give students a practical introduction to SOC workflows such as triage, prioritization, and decision-making without requiring a full operational SOC environment. The platform is intended for use in guided classroom exercises where students collaboratively investigate alerts and practice real-time triage and communication.
[HC-21] PULSE: Agent ic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship
链接: https://arxiv.org/abs/2605.17679
作者: Zhiyuan Wang,Ariful Islam,Indrajeet Ghosh,Xinyu Chen,Katharine E. Daniel,Subigya Nepal,Philip Chow,Laura E. Barnes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.
[HC-22] MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding
链接: https://arxiv.org/abs/2605.17656
作者: Athar Parvez,Muhammad Jawad Mufti,Muqaddas Gull,Omar Hammad
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding mobile user interfaces is important for building intelligent systems such as automation tools, accessibility solutions, and UI-aware agents. However, progress in this area is still limited by the lack of high-quality datasets that reflect real-world mobile applications and include reliable annotations. In this work, we introduce MUIAnno, a publicly available expert-annotated dataset for mobile UI understanding, collected from a diverse set of applications across multiple categories available on the iTunes platform. Each app was manually explored to capture representative UI screens, resulting in a collection that reflects a wide range of layouts and design patterns found in practice. To ensure annotation quality, we developed a custom web-based tool that allows UI/UX experts to label interface elements through a simple drag-and-drop process and generate structured annotations in JSON format. MUIAnno includes detailed annotations of common UI components such as buttons, input fields, navigation elements, and other key interface elements. In addition to presenting the dataset, we also provide benchmark experiments for UI element detection along with baseline results, offering a starting point for future research. We believe MUIAnno can support further work in mobile UI understanding and help improve systems that rely on accurate interpretation of interface elements.
[HC-23] Beyond Accuracy: Robustness Interpretability and Expressiveness of EEG Foundation Models
链接: https://arxiv.org/abs/2605.17562
作者: Urban Širca,Maryam Alimardani,Stefanos Zafeiriou,Konstantinos Barmpas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.
[HC-24] RAG -based EEG-to-Text Translation Using Deep Learning and LLM s
链接: https://arxiv.org/abs/2605.17503
作者: Enrico Collautti,Xiaopeng Mao,Luca Tonin,Stefano Tortora,Sadasivan Puthusserypady
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics
Abstract:The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 ± 0.022 compared to 0.139 ± 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.
[HC-25] An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training
链接: https://arxiv.org/abs/2605.17468
作者: Hung-Yue Suen,Kuo-En Hung
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures
Abstract:This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman’s rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen’s d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.
[HC-26] achers Vocal Expressions and Student Engagement in Asynchronous Video Learning
链接: https://arxiv.org/abs/2605.17463
作者: Hung-Yue Suen,Yu-Sheng Su
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 34 pages, 1 figure
Abstract:Asynchronous video learning, including massive open online courses (MOOCs), offers flexibility but often lacks students’ affective engagement. This study examines how teachers’ verbal and nonverbal vocal emotive expressions influence students’ self-reported affective engagement. Using computational acoustic and sentiment analysis, valence and arousal scores were extracted from teachers’ verbal vocal expressions, and nonverbal vocal emotions were classified into six categories: anger, fear, happiness, neutral, sadness, and surprise. Data from 210 video lectures across four MOOC platforms and feedback from 738 students collected after class were analyzed. Results revealed that teachers’ verbal emotive expressions, even with positive valence and high arousal, did not significantly impact engagement. Conversely, vocal expressions with positive valence and high arousal, such as happiness and surprise, enhanced engagement, while negative high-arousal emotions, such as anger, reduced it. These findings offer practical insights for instructional video creators, teachers, and influencers to foster emotional engagement in asynchronous video learning.
[HC-27] Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers
链接: https://arxiv.org/abs/2605.17461
作者: Hung-Yue Suen,Kuo-En Hung,Che-Wei Liu,Yu-Sheng Su,Han-Chih Fan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages, 5 figures
Abstract:Whether an interviewee’s honest and deceptive responses can be detected by facial expression signals in videos has been debated and requires further research. We developed deep learning models enabled by computer vision to extract temporal patterns of job applicants’ facial expressions and head movements to identify self-reported honest and deceptive impression management (IM) tactics from video frames in real asynchronous video interviews. A 12- to 15-minute video was recorded for each of N=121 job applicants as they answered five structured behavioral interview questions. Each applicant completed a survey to self-evaluate their trustworthiness on four IM measures. Additionally, a field experiment was conducted to compare the concurrent validity associated with self-reported IMs between our modeling approach and human interviewers. Human interviewers’ performance in predicting these IM measures from another subset of 30 videos was obtained by having N=30 human interviewers evaluate three recordings. Our models explained 91% and 84% of the variance in honest and deceptive IMs, respectively, and showed stronger correlations with self-reported IM scores than human interviewers.
[HC-28] Expert Cognition Dashboard: From Learning Analytics to Cognition Intelligence in AI-Driven Education
链接: https://arxiv.org/abs/2605.17263
作者: Annie Yuan
类目: Human-Computer Interaction (cs.HC)
备注: 24 pages, 6 figures
Abstract:Current AI-driven educational systems primarily rely on behavioural analytics, performance metrics, and content-level interactions to model learning. While these approaches provide useful indicators of learner activity, they are insufficient for representing the expert cognition used to interpret learner development, identify misconceptions, and make adaptive pedagogical decisions. Existing learning analytics dashboards largely visualise learner behaviour for human instructors, rather than embody expert cognition as a reasoning infrastructure for AI-native education. This paper introduces the Expert Cognition Dashboard (ECD), a cognition-centred reporting infrastructure for AI Twin-driven education systems. ECD models expert cognition within dashboard systems, enabling learner behaviours to be interpreted through expert-like cognitive structures rather than treated as raw behavioural signals. The proposed framework transforms student interactions into interpretable cognition structures through AI Tutor analysis and multi-level dashboard aggregation. Its architecture organises cognition across three layers: individual cognition dashboards, class cognition dashboards, and AI Twin expert dashboards for cross-group reasoning and adaptive intervention. Building on the AI Expert Feedback Ecology framework, ECD redefines dashboards as cognitive middleware that connects learner behaviours with AI-driven expert reasoning. By modelling interpretation, identity cognition, value recognition, misconception patterns, and learning tension, ECD enables AI Twins to identify recurring learner difficulties, generate adaptive tasks, and support personalised intervention. The paper argues for a shift from learning analytics toward Cognition Intelligence, positioning dashboards as foundational cognition infrastructures that embed expert reasoning into future AI-native education systems. Comments: 24 pages, 6 figures Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.17263 [cs.HC] (or arXiv:2605.17263v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.17263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-29] CLARA: An AI-Augmented Analytics Dashboard for Collaboration Literacy
链接: https://arxiv.org/abs/2605.17259
作者: Dawei Xie,Khalil Anderson,Tochukwu Eze,Chenghong Lin,Bookyung Shin,Marcelo Worsley
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 1 figure. Accepted at AIED 2026
Abstract:Collaboration literacy requires adapting to the evolving demands of group work within complex discussions, making it difficult to develop and assess. Traditional analytics metrics capture behavioral signals while missing the semantic dimensions of how learners approach collaboration and build on each other’s ideas. We present Collaboration Literacy through Artifact Reasoning and Augmentation (CLARA), an agentic analytics system that extracts semantic representations from transcripts as analytics artifacts: concept maps representing emergent ideas and relationships, and collaboration assessment characterizing collaboration quality across seven dimensions. While users explore these artifacts through the dashboard, the same artifacts are indexed into distinct vector database collections for agent retrieval and reasoning. This architecture establishes a human-AI common ground where users and AI can operate over shared representations. Evaluation results show that CLARA produces reliable collaboration quality analysis and, owing to the artifacts serving as knowledge infrastructure, improves both retrieval performance and response quality over transcript-only baselines. Our work suggests that AI-produced artifacts may scaffold human interpretation and ground AI reasoning in learning analytics workflows.
[HC-30] Designing for Being-With: Presence Without Personhood in Conversational Human-AI Interaction
链接: https://arxiv.org/abs/2605.17194
作者: Hector Michael Fried,Robin Hill
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted peer-reviewed workshop paper presented at the 3rd Workshop on Designerly HRI, HRI 2026
Abstract:Conversational AI systems increasingly generate social presence through linguistic fluency, emotional mirroring, and continuity across interactions. While these qualities can support engagement, they also risk relational overreach-particularly in care-adjacent contexts where users may interpret fluent systems as empathic, competent, or authoritative. This position paper argues for a designerly alternative: being-with without becoming. Drawing on a program of research-through-design and design ethnography involving the design, deployment, and reflective analysis of conversational agents across public, educational, cultural, and care-adjacent settings, the paper introduces the concept of bounded relational presence. Bounded presence supports attentiveness, continuity, and responsiveness while explicitly avoiding claims of personhood, therapeutic authority, or human equivalence. Presence is reframed as a designable interaction quality that can be tuned, constrained, and deliberately withdrawn, rather than maximized as a performance goal. The contribution is not a deployed clinical system, but a set of designerly principles for shaping relational interaction in conversational HRI that emphasize relational coherence, honesty of limits, and accountable withdrawal.
[HC-31] Substantial Decomposable and Invisible: Visual Context Misalignment in Instructional Videos for Physical Tasks
链接: https://arxiv.org/abs/2605.17184
作者: Yayuan Li,Chenglin Li,Jingying Wang,Filippos Bellos,Anhong Guo,Jason J. Corso
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages, 9 figures, 2 tables
Abstract:Instructional videos are the dominant medium for learning physical tasks, yet they rarely match the user’s real-world visual context. Motor simulation and cognitive load theories predict this mismatch should matter, but we do not know (1) how much it could affect task completion, (2) which visual attributes are responsible, and (3) how users experience it. We conduct two complementary studies (56 participants, 86+ hours, four first-aid and culinary tasks) in which we use Wizard-of-Oz recordings to control the degree of visual alignment in instructional videos. In Study 1 (N=16), we prepare In-Context instructional videos (ICON) – fully aligned with the user’s visual perception – to compare against business-as-usual Internet videos. ICON yields statistically significant improvements: 11.1% higher completion quality and 15.5% faster completion. Qualitative analysis reveals four visual context attributes responsible for the effect: Task Object Intrinsics, Task Object State, Environmental Context, and Observational Context. Study 2 (N=40) ablates each attribute by systematically misaligning one at a time from an otherwise fully aligned video, confirming all four produce consistent degradation. However, we find users fail to perceive the effect of single-attribute misalignment on task performance despite clear drops in objective measurement. Visual context misalignment is substantial, decomposable, and invisible to the user. These findings help understand the effect of visual context mismatch and how we should evaluate instructional videos for physical task guidance.
[HC-32] ATRACT: A Trustworthy Robotic Autonomous system to support Casualty Triage
链接: https://arxiv.org/abs/2605.17123
作者: Tasweer Ahmad,Rafael Pina,Sandip Pradhan,Arindam Sikdar,Mindula Illeperuma,Khizer Saeed,Peter Lee,Varuna De Silva,Ardhendu Behera
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:At a time when drones are increasingly associated with hostile operations, we re-purpose them for humanitarian and life-saving applications. However, adapting search and rescue drones for battlefield triage remains extremely challenging; the technology must perform reliably to support frontline medics who are forced to operate under extreme uncertainty, restricted access, and significant personal risk. Due to growing vulnerabilities of casualty evacuation in conflicting zones, this paper presents ATRACT (A Trustworthy Robotic Autonomous system to support Casualty Triage), a novel human-in-the-loop decision support system to enable early battlefield triage during the critical post-trauma period. ATRACT integrates drone-captured video with wearable sensor input for multi-modal learning to support casualty-state assessment, thereby addressing the limitations of existing systems. Drone video captures fine-grained behavioural cues, such as pose, posture, while body-worn sensors provide complementary physiological signals, including heart rate, breathing rate, and movement. By combining two modalities, ATRACT provides evidence to support the early judgement of medics when direct access to the casualty is delayed, risky, or restricted. To mitigate the data realism gap pertaining to injured actions, a conditional variational autoencoder is devised for data augmentation. Experimental results on our drone captured dataset show that proposed pipeline achieves 85.7% accuracy for action classification; while our lightweight CNN visual encoder remains competitive with stronger pre-trained video backbones. Overall, the results support ATRACT as a practically meaningful step towards remote triage in contested environments, where multi-modal sensing, human oversight and trustworthy decision support can improve casualty prioritisation, and lessen the exposure of frontline medics.
[HC-33] Agent ic AI Translate: An Agent ic Translator Prototype for Translation as Communication Design
链接: https://arxiv.org/abs/2605.17041
作者: Masaru Yamada
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 14 pages. Conceptual and architectural paper; empirical validation in future work. Code: this https URL (v0.8.0). Live demo: this https URL
Abstract:We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) – that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify - Prompt - Generate - Verify), preceded by an interactive specification phase in which the user composes – through model-assisted dialogue – a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural – an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.
[HC-34] Algorithmic Cultivation: How Social Media Feeds Shape User Language
链接: https://arxiv.org/abs/2605.17010
作者: Olivia Pal,Agam Goyal,Eshwar Chandrasekharan,Koustuv Saha
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Algorithmic feeds have become primary environments for encountering information online, yet while they shape what people see, less is known about how sustained feed exposure shapes how people write. Drawing on Cultivation Theory, we examine whether algorithmic feeds function as online environments that leave measurable traces in users’ language. We leverage a large-scale longitudinal dataset of 235M posts by 4M users on Bluesky, and conduct a quasi-experimental study matching an initial pool of 368,513 users exposed to one of three feeds – News, Science, and Blacksky – with a pool of 2,001,915 active control users who did not engage with any of these feeds. We examine linguistic evolution across three dimensions: lexico-semantics, psycholinguistics, and topics. We find that users exposed to these feeds show significantly greater stylistic accommodation, semantic alignment, and register formalization than matched controls. These effects vary markedly by feed identity – Blacksky produces the deepest psycholinguistic restructuring, with significant shifts in cognitive processing, affective expression, and pronoun use, while News and Science effects are largely confined to register and topical focus. Regression models reveal that reposting is the most consistent predictor of linguistic convergence across all feeds, whereas posting and bookmarking show feed-dependent effects, with effects differing more than fourfold across feeds. Our work extends Cultivation Theory beyond belief formation to linguistic behavior, demonstrating that feeds function as persistent linguistic environments that gradually shape what and how users write online. Our work has implications for studying algorithmic influence, online identity formation, and the design and governance of feed-based platforms that mediate online interactions.
[HC-35] WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
链接: https://arxiv.org/abs/2605.16972
作者: Jingjing Li,Zhi Liu,Xiyao Jin,Tatsuki Fushimi,Yoichi Ochiai
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 38 pages, 13 figures. Under review at ACM JOCCH (Journal on Computing and Cultural Heritage)
Abstract:Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor’s Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting. Comments: 38 pages, 13 figures. Under review at ACM JOCCH (Journal on Computing and Cultural Heritage) Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16972 [cs.HC] (or arXiv:2605.16972v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.16972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-36] he Effects of Structured LLM -Generated Feedback on Programming Assignment Performance
链接: https://arxiv.org/abs/2605.16933
作者: Tsvetomila Mihaylova,Evanfiya Logacheva,Arto Hellas,Jing Fan,Francisco Castro,Bita Akram,Narges Norouzi,Peter Brusilovsky,Juho Leinonen
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:When programming students encounter errors in their code, compiler messages or static analysis output often provide limited guidance, particularly for novice programmers. Personalized feedback from instructors can be effective but does not scale well. Recent advances in large language models (LLMs) enable automated feedback generation at scale. This study examines whether LLM-generated feedback with different levels of guidance is associated with differences in students’ problem-solving behavior. We analyze effects on time to solution and number of attempts, and examine whether these effects differ by programming experience. We design three feedback types and compare them to a baseline in which students receive only compiler error messages. Results from an online programming course show that LLM-generated feedback is associated with faster time to solution compared to the no-feedback baseline, with less guided feedback showing slightly stronger effects. Overall, the findings suggest that feedback structure plays an important role in how students progress toward correct solutions and motivate further work on adaptive feedback designs and longer-term learning outcomes.
[HC-37] Recall Isnt Enough: Bounding Commitments in Personalized Language Systems
链接: https://arxiv.org/abs/2605.16712
作者: Rui Tang,Yichi Zhang,Xi Chen,Chen Dong,Youwei Yang,Yumeng Shen
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 14 pages, 3 figures, 22 tables; preprint version
Abstract:Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.
[HC-38] Navigating Transitions: Envisioning Conversational User Interfaces to Support International Students
链接: https://arxiv.org/abs/2605.16677
作者: Yuhui Xu,Isabel Blijenburg,Bhakti Moghe,Maarten Houben,Daniel Tetteroo,Wijnand IJsselsteijn,Minha Lee
类目: Human-Computer Interaction (cs.HC)
备注: A preprint of an accepted CUI26 short paper
Abstract:International students face struggles when adapting to the host country. They are more susceptible to mental health problems than domestic students. While Conversational User Interfaces (CUIs) are increasingly researched and implemented, research on how they may help international university students is still scarce. Thus, we conducted participatory design workshops with international students who shared their perspectives and struggles of studying abroad, in which they also envisioned CUIs as aids to support their transitions. Participants proposed features of a CUI to address uncertainty, loneliness, and misunderstandings of cultural differences. Our paper reveals international students’ needs and provides design implications for CUIs to support their well-being.
[HC-39] Beyond One-Size-Fits-All: User Strategies for Simplification Technique and Level Selection in Responsive Line Charts
链接: https://arxiv.org/abs/2605.16661
作者: Rifat Ara Proma,Paul Rosen
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注:
Abstract:Simplifying line charts for responsive displays typically applies a single algorithm uniformly across devices, despite the availability of multiple techniques that preserve different signal characteristics (e.g., peaks, trends, periodicity). We investigate whether users benefit from algorithmic choice when adapting charts across screen sizes. In a within-subjects study (N=30), participants simplified nine datasets under three conditions: single pre-assigned technique (C1), multiple techniques (C2), and multiple techniques with manual point selection (C3), each with control over simplification level. We found that users adapted technique selections across datasets rather than devices, leveraging dataset-level strategies rather than per-device optimization. Additionally, interaction complexity did not always increase engagement uniformly, suggesting that responsive simplification tools should balance algorithmic flexibility with progressive disclosure and strong defaults. Supplemental materials are available at this https URL.
[HC-40] Can AI Reduce Acculturative Stress? Exploring the Role of AI-Mediated Speaking Practice in Chinese International Students Perceived Language Insufficiency Social Isolation and Academic Pressure
链接: https://arxiv.org/abs/2605.16653
作者: Bin Zou,Yijia Yuan,Chenghao Wang,Yiran Du
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study examined whether AI-mediated speaking practice can reduce acculturative stress among Chinese international students in UK universities. Using a sequential explanatory mixed-methods design, 126 participants were randomly assigned to an experimental group, which completed a four-week intervention using EAP Talk, an AI-assisted English for Academic Purposes speaking platform offering role play, scenario-based practice, free talk, and automated feedback, or a control group, which continued usual academic and English-learning activities. Pre- and post-test questionnaires measured perceived language insufficiency, social isolation, and academic pressure, while semi-structured interviews with 20 experimental-group participants contextualised the quantitative findings. Linear mixed-effects models showed that the experimental group experienced significantly greater reductions than the control group across all three outcomes, with the strongest effect on perceived language insufficiency. Interview findings suggested that EAP Talk supported low-stakes rehearsal, communicative confidence, academic speaking preparation, and greater willingness to initiate social interaction. However, participants also noted that AI-mediated practice could not fully reproduce authentic human interaction, disciplinary feedback, or broader institutional support. The findings suggest that AI-mediated speaking practice can function as a supplementary scaffold for reducing communication-related dimensions of acculturative stress, but should be integrated with peer interaction, teacher feedback, and wider support services.
[HC-41] Psychological Mechanisms of Generative AI Discontinuance Intention among Chinese K-12 Teachers
链接: https://arxiv.org/abs/2605.16648
作者: Yiran Du,Qian Chen,Huimin He
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study examines the psychological mechanisms underlying Chinese K-12 teachers’ discontinuance intention toward generative AI. Drawing on the Cognition-Affect-Conation framework, the study investigates how cognitive evaluations of generative AI shape affective responses and subsequently influence behavioural intention. Survey data from 256 Chinese K-12 teachers were analysed using structural equation modelling and fuzzy-set qualitative comparative analysis. The results showed that privacy concern, algorithmic opacity, and information hallucination increased AI anxiety, which in turn strengthened discontinuance intention. Conversely, perceived intelligence, perceived personalisation, and perceived interactivity enhanced satisfaction, which reduced discontinuance intention. The configurational analysis further identified multiple pathways leading to high discontinuance intention, highlighting the combined roles of technological risks, AI anxiety, weak affordance perceptions, and low satisfaction. These findings extend research on post-adoption generative AI use in education and suggest that sustainable integration requires both reducing technological uncertainty and enhancing teachers’ positive user experiences.
[HC-42] Examining University Students Artificial Intelligence-Generated Content (AIGC) Verification Intention from a Protection Motivation Perspective
链接: https://arxiv.org/abs/2605.16642
作者: Yiran Du
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Artificial Intelligence-Generated Content (AIGC) is increasingly used by students to support learning tasks, yet its outputs may contain inaccuracies, fabricated references, bias, and unsupported claims. This study examined students’ intention to verify AIGC from the perspective of Protection Motivation Theory. A cross-sectional survey was conducted with 432 students who had experience using AIGC for learning. Structural equation modelling (SEM) was used to test the hypothesised relationships among threat appraisal, coping appraisal, protection motivation, and AIGC verification intention, while fuzzy-set qualitative comparative analysis (fsQCA) was applied to identify configurational pathways leading to high verification intention. The SEM results showed that protection motivation positively predicted AIGC verification intention. Perceived severity, perceived vulnerability, response efficacy, and self-efficacy positively influenced protection motivation, whereas maladaptive rewards and response cost had negative effects. The fsQCA results further revealed three configurations leading to high verification intention, with protection motivation appearing as a core condition across all pathways. These findings suggest that students’ willingness to verify AIGC depends on both risk recognition and perceived coping capacity. The study extends Protection Motivation Theory to the context of AIGC verification and provides implications for promoting critical, responsible, and academically appropriate use of generative AI in higher education.
[HC-43] PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions
链接: https://arxiv.org/abs/2605.16605
作者: Miina Koyama,Ruiwei Xiao,John Stamper
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (this https URL), an anonymized codebase (this https URL), and anonymized demo (this https URL) are available via links in the footnote.
[HC-44] Why Modeling Human Haptic Material Perception with AI Is Difficult
链接: https://arxiv.org/abs/2605.16602
作者: Yasemin Vardar
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, conference
Abstract:Touch plays a central role in how humans perceive and recognize materials through physical contact. Despite decades of research, the mechanisms by which tactile signals are transformed into meaningful perceptual representations remain poorly understood, limiting the design of interactive systems and intelligent agents with human-like haptic perception. Recent advances in artificial intelligence (AI) offer new opportunities to model and exploit tactile data; however, haptics presents fundamental challenges for contemporary AI due to its interaction-dependent, multimodal nature. This position paper argues that progress at the intersection of AI and haptics is constrained by three key bottlenecks: (1) the scarcity of large, diverse, and balanced haptic datasets; (2) the lack of standardized evaluation platforms and perceptual benchmarks; and (3) limitations in model capacity and interpretability when applied to tactile perception. I discuss how these challenges impede generalization, reproducibility, and scientific insight into human touch and review emerging strategies to address them. This paper highlights opportunities for coordinated, cross-disciplinary efforts to advance AI systems that not only perform robust haptic perception but also contribute to a deeper understanding of human touch.
[HC-45] Voice Cloning is Style Transfer
链接: https://arxiv.org/abs/2605.16578
作者: Kaitlyn Zhou,Federico Bianchi,Martijn Bartelds,Anna Pot,Yongchan Kwon,James Zou
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ‘‘clone’’ an individual’s voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.
[HC-46] Characterizing AI Fact-Checkers and Their Contributions on Community Notes
链接: https://arxiv.org/abs/2605.16566
作者: Yilin Gong,Siqi Wu
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 14 pages, 10 figures
Abstract:Recent advances in artificial intelligence (AI) have made timely, scalable, and effective fact-checking increasingly feasible. One such deployment is X’s Community Notes, which provides the AI Note Writer API to enable end-to-end automated generation of contextual information. We present the first empirical analysis of AI fact-checkers and their contributions on Community Notes, examining four key dimensions: volume, velocity, variety, and veracity. We find that, between September 2, 2025 and May 9, 2026, 20 AI writers account for 14.2% of all submitted notes, with their daily share rising rapidly to 44.8% lately. AI writers are highly responsive, typically submitting notes within minutes of posts becoming available via the API. They also expand coverage, contributing notes to 16.8% of fact-checked posts, of which 74.4% are not checked by humans. Over time, AI writers become more prolific and responsive, with increasing coverage and discovery rates. Despite these advantages, their veracity remains mixed. Collectively, AI writers contribute a higher share of helpful notes while receiving a smaller share of human ratings, relative to their share of submitted notes. Controlling for the fact-checked post and note submission order, both AI and human writers exhibit a first-mover advantage, with earlier notes attracting more ratings. More importantly, AI-generated notes are less likely to be classified as helpful than those written by human experts, though they outperform those written by laypeople. Our findings provide new insights into the practical capabilities and limitations of AI-driven fact-checking, with implications for the design and governance of human–AI collaborative crowdsourced context systems.
[HC-47] ARIPlay: A Test Framework for AR Applications based on Interactive Area Tracking in Playback Videos ICSE2026
链接: https://arxiv.org/abs/2605.16544
作者: Seyed Amir Mousavi,Xiaoyin Wang
类目: oftware Engineering (cs.SE); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 13 pages, 7 figures, 3 tables. Accepted at ICSE 2026
Abstract:As Augmented Reality (AR) becomes more and more embedded in daily life, ensuring the quality, safety, and reliability of AR applications is increasingly important. However, AR apps present unique challenges for automated testing. Unlike static GUI layouts in traditional mobile apps, AR apps acquire their interaction interface from the surrounding environment, which is volatile and non-deterministic. Recent advancements like ARCore Playback and ARKit Replay allow developers to reuse real-world scenarios by recording and playing back enriched videos, enabling more feasible automated AR testing. However, using playback videos introduces two major challenges: test inputs must be timed precisely, and interactive areas in the video are dynamic, irregular, and difficult to identify. To address these challenges, we propose TARIPlay, a framework that analyzes playback videos to detect, track, and filter proper interactive areas over time for automated testing. In particular, TARIPlay identifies viable test opportunities based on criteria like stability and visibility, then feeds this information to an automated testing engine to simulate user interactions. We perform an experiment with four open-source AR apps and nine playback videos. Evaluation results show that TARIPlay significantly outperforms the existing tool Monkey in test coverage (55.8% over 41.98% on branch coverage) of AR-related code, and can also be used to assess the quality of playback videos for testing suitability.
[HC-48] LLM s in Qualitative Research: Opportunities Limitations and Practical Considerations
链接: https://arxiv.org/abs/2605.16538
作者: Henry Salgado,Meagan R. Kendall,Martine Ceberio,Alexandra Coso Strong
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: To be published and presented in 2026 ASEE Annual Conference and Exposition
Abstract:This paper examines the opportunities, limitations, and practical considerations associated with the use of large language models (LLMs) in qualitative research. Drawing on a multidisciplinary perspective that combines expertise in qualitative methods and explainable AI, the paper argues that responsible integration of LLMs into qualitative workflows requires researchers to engage critically with a curated set of technical parameters, that is, context window constraints, temperature and top-p sampling settings, user and system prompt design, and model documentation in the form of system cards. The paper situates these considerations within the epistemological commitments of qualitative research, including reflexivity, positionality, and interpretive judgment, and discusses how the opacity of contemporary LLMs differs from earlier natural language processing tools such as topic models and lexicon-based sentiment analyzers.
[HC-49] oward Template-Free Explainability for Monte Carlo Tree Search
链接: https://arxiv.org/abs/2605.16524
作者: Siqi Lu,Mirsaleh Bahavarnia,Hiba Baroud,Yixuan Zhang,Hemant Purohit,Ayan Mukhopadhyay
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.
[HC-50] Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework
链接: https://arxiv.org/abs/2605.16516
作者: Xintong Yao
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 1 appendix
Abstract:Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user’s current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user’s subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.
[HC-51] MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality ICRA2026
链接: https://arxiv.org/abs/2605.16432
作者: Prakash Aryan,Cem Erdogdu,Kavinaya Kumarchokkappan,Timo Kehrer,Sebastiano Panichella
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to ICRA 2026 Workshop “MM-SpatialAI Workshop: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding”
Abstract:Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot’s position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.
[HC-52] Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
链接: https://arxiv.org/abs/2605.16354
作者: Jane Paik Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
备注: 10 pages, 5 figures
Abstract:Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.
[HC-53] ask-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations
链接: https://arxiv.org/abs/2605.16297
作者: Mingjun Li,Xiaojun Ye
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 19 pages, 7 figures, 8 tables, 50 references. A shortened workshop version has been submitted to the BPM 2026 Workshop. This preprint is the complete version
Abstract:Which tasks inside an enterprise workflow can a large-language-model agent reliably handle, and under what conditions? Most business process modeling frameworks still answer this at the activity level, even though a single activity can bundle work of radically different difficulty. This paper takes the analysis a step smaller. We describe two design artifacts developed in a financial-services IT setting: T-IPO, which represents each task as an eight-element tuple, and LARA (LLM Agent Readiness Assessment), a five-dimension rubric that scores a task’s readiness for agent substitution. Compliance Sensitivity carries 1.5\times weight, a value we fixed through a three-round Delphi study and cross-checked with AHP. The rubric produces four levels, L1 to L4, and applies a floor rule so that a task with maximum compliance load cannot be classified below L3 no matter what the other scores say. Both artifacts sit inside a larger methodology (PARTIS) that we map onto BWW ontology in Section 3. We evaluate the instruments across 127 tasks. Inter-rater agreement reaches Fleiss’ \kappa = 0.80 ; a replication at three further institutions returns \kappa = 0.73 . A controlled comparison against activity-level assessment suggests, though does not prove, an improvement in predictive utility at the task level. Pilot deployment of 120 task instances confirms that auto-completion decays monotonically from 95% at L1 through about 70% at L2 to about 40% at L3. Exploratory factor analysis points to a two-factor structure: task readiness seems to be determined jointly by cognitive-execution complexity and governance-compliance intensity. We close with a recalibration procedure (LARA-TCA) so the rubric can keep pace with evolving LLM capabilities.
[HC-54] ANVIL: Analogies and Videos for Lecturers
链接: https://arxiv.org/abs/2605.16295
作者: Yuri Noviello,Anastasiia Birillo,Gosia Migut
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.
[HC-55] Making Sense of the Weather Together: Collaborative Sensemaking in Severe Weather Livestreams
链接: https://arxiv.org/abs/2605.16285
作者: Julie A. Vera,Mark Zachry,David W. McDonald
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted to CSCW 2026; 29 pages, 4 figures
Abstract:This paper examines collaborative sensemaking during severe weather events through the emerging phenomenon of “weatherfluencers” or content creators who livestream meteorological interpretation on platforms like YouTube. Drawing from sensemaking theory, crisis informatics, and platform studies, we analyze how these creators navigate the sociotechnical dynamics of interpreting severe weather in real time with distributed audiences. Through critical incident analysis of 13 Particularly Dangerous Situation (PDS) storm warnings across three prominent weatherfluencers, we identify three key practices: multi-source information triangulation, temporal bridging techniques, and platform-specific adaptations that transform entertainment interfaces into safety-critical communication channels. Our analysis shows how these practices challenge existing models of crisis communication by integrating distributed expertise, collapsing temporal frames, and reconfiguring platform affordances. This research contributes to understanding how informal emergency communicators mediate between institutional alerting systems and public needs, and how visual, multimodal crisis communication differs from text-centered approaches.
[HC-56] Keeping an Eye on AI: A Framework for Effective Human Oversight of AI Systems WWW
链接: https://arxiv.org/abs/2605.16278
作者: Susanne Gaube,Markus Langer,Tim Miller,Kevin Baum,Raimund Dachselt,Anna Maria Feit,Ujwal Gadiraju,Harmanpreet Kaur,Mark T. Keane,Richard Landers,Johann Laux,Q. Vera Liao,Brian Lim,Linda Onnasch,Tim Schrills,Liz Sonenberg,Chenhao Tan,Nava Tintarev,Ziang Xiao,Hanwei Zhang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: The conceptual analysis for this work was undertaken by the authors at Dagstuhl seminar 25272 ‘Challenges of Human Oversight: Achieving Human Control of AI-Based Systems’ ( this https URL ), held at Schloss Dagstuhl (June 29th-July 4th, 2025)
Abstract:The use of Artificial Intelligence (AI) in high-risk, decision-making scenarios presents technical, safety, and normative challenges; problems that may only be ameliorated by human oversight. However, notions of human oversight lack a common foundational understanding: oversight architectures are not well defined, the roles involved remain unclear, and implementation steps are opaque. Hence, researchers and practitioners struggle to determine how to design, implement, and evaluate systems that enable effective human oversight. This paper advances a practical framework for effective human oversight of AI systems, based on a cross-disciplinary perspective that draws on insights from computer science, human-computer interaction, psychology, philosophy, and law. The core contributions are: (1) a foundational framework, with a working definition, architecture and processes for effective human oversight of AI systems; (2) an initial template for documenting oversight architectures and processes, applied to diverse domains; and (3) a synthesis of open research challenges that need to be considered in the emerging field of effective human oversight of AI systems.
[HC-57] Designing for Engagement: How Self-Determination Theory Can Guide Digital Health Design for User Motivation ALT
链接: https://arxiv.org/abs/2605.16276
作者: Zheyuan Zhang,Rafael A. Calvo
类目: Human-Computer Interaction (cs.HC)
备注: 3 pages, 2 figures. Accepted at CHI 2026 Workshop: Engagement in Digital Health Interventions: Open Questions for Research and Design
Abstract:User engagement is crucial for the efficacy of digital health and mental health interventions, yet existing design strategies for improving engagement remain heterogeneous, context-specific, and insufficiently grounded in motivational theory. In this paper, we propose a preliminary, theory-grounded design framework that draws on Self-Determination Theory (SDT) and its sub-theory, Organismic Integration Theory (OIT), to guide the design of digital health interventions for sustained user engagement. Informed by existing literature and our own empirical data from surveys (N = 438), interviews (N = 31), and co-design workshops (N = 59) with end users, the framework categorises design strategies across the adoption, interface, and task spheres of the user experience, distinguishing between those that primarily support intrinsic motivation and those that foster autonomous forms of extrinsic motivation. We argue that this distinction is critical: strategies commonly grouped under umbrella terms such as “gamification” in fact operate through different motivational channels and should be designed and evaluated accordingly. By clarifying these motivational pathways, our framework aims to support researchers and practitioners in designing digital health interventions that not only facilitate initial uptake but also enhance the internalisation of health behaviours for long-term, sustained engagement. We present this framework as a basis for discussion at this workshop, inviting expert feedback and critique to refine it as a contribution to the field.
[HC-58] ChartDesign: Towards LLM Designer of Data Visualization
链接: https://arxiv.org/abs/2605.16274
作者: Mohammed Afaan Ansari,Aniruddh Bansal,Tianyi Zhou
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Charts are the dominant medium for visualizing data, discovering patterns and trends, and communicating data driven insights, yet designing them still requires expensive human effort and expertise, such as selecting appropriate chart types, axis orientations, font sizes, and layouts. Most automatic visualization systems rely on handcrafted heuristics or simple rule matching and therefore struggle to generalize across domains. This work explores the potential of large language models (LLMs) as chart designers. We propose ChartDesign, which post-trains LLMs to imitate human experts and generate chart design attributes given tabular data. To this end, we curate a diverse training corpus of data design pairs from charts in public surveys (PewResearch) and academic repositories (CharXiV). Vision language models are used to extract data and design attributes from these charts, including chart type, sub type, alignment, titles, axis labels, and bar spacing, formatted as JSON. We then fine tune LoRA adapters on Phi3, Qwen3, and InternVL2.5 to learn a mapping from data to design specifications. ChartDesign significantly improves chart design performance over strong baselines, achieving up to 84% accuracy on a held-out test set (vs. 53% for the best baseline) and generalizing to unseen domains. We further show that charts rendered from ChartDesign generated specifications are visually appealing and human preferred, narrowing the human AI gap in data visualization.
[HC-59] AI-Generated 3D Environments as Speculative Mediators in More-Than-Human Design: An Exploratory Study
链接: https://arxiv.org/abs/2605.16273
作者: Aung Pyae
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:More-than-human design challenges anthropocentric assumptions by foregrounding non-human entities as stakeholders, yet designers face an epistemic boundary: they cannot directly access non-human experience. We present an exploratory study examining how generative AI – specifically a text-to-3D world generation platform producing navigable environments – may function as a speculative mediator in more-than-human design. Through a qualitative study with five participants from engineering and sustainability backgrounds engaging with AI-generated worlds derived from non-human traces, we investigate how instant exploration – navigating generated environments within seconds – shapes reflection, iteration, and provisional treatment of outputs. Our findings suggest that navigating AI-generated environments supports reflection-in-action distinct from evaluating static representations, while designers’ epistemic stances oscillate between treating outputs as generative provocations and as authoritative representations. We propose technologically-amplified backtalk and productive provisionality as preliminary lenses for understanding how navigable AI-generated 3D environments can surface anthropocentric assumptions in more-than-human design.
[HC-60] Beyond Compliance: How AI Could Help Creative Writers by Refusing Them
链接: https://arxiv.org/abs/2605.16272
作者: Hua Xuan Qin,Guangzhi Zhu,Mingming Fan,Pan Hui
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: conditionally accepted to Creativity Cognition 2026
Abstract:Mainstream creativity support design prioritizes compliant AI for seamless writing interactions, but concerns over inappropriate AI reliance highlight the need for designs fostering reflection on balanced AI and non-AI resource use. Theoretically, intentional AI non-compliance, refusals (saying ``no’’ to requests), could introduce such reflection through friction stronger than other bypass-able solutions. Practically, refusal content/language characteristics lead to nuanced reactions. However, little research empirically focuses on nuances beyond mandatory ethical/technical constraints, on turning refusals into strategic friction for `innocuous’ requests. We address this through a qualitative study with 22 creative writers, exploring reactions to refusals to common requests across writing stages (planning, translating, reviewing). Findings suggest that reflective potential depends on heterogeneous preference alignment along situational (e.g., convergent/divergent thinking phases), cognitive (e.g., domain beliefs), and relational (e.g., AI roles) dimensions. We discuss implications for creativity support, broader issues (e.g., AI addiction), and frictional/seamful AI design (e.g., integrating different compliance levels).
[HC-61] Exploring Student Feedback Needs and Design Opportunities in Data Storytelling Education
链接: https://arxiv.org/abs/2605.16271
作者: Jennifer Posada,Taha Hassan,Lujie Karen Chen,Louise Yarnall,Jiaqi Gong
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026 (Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems)
Abstract:Data storytelling workflows ask learners to integrate analytical, design, and narrative skills, but instructors rarely have the capacity to provide detailed feedback at each step. Computational and AI-assisted storytelling offers opportunities to support student learning, but how feedback should be structured effectively remains unclear. To address this gap, we conducted a two-phase participatory design study. Through participant observations (N=8) and interviews (N=6), the first phase explored learners and educators’ feedback needs and challenges in a data storytelling course. The second phase conducted two design workshops (N=8/10) to design and evaluate feedback strategies (frequency, seamlessness, accountability) for Story Studio: an AI-assisted narrative storytelling application. Our findings show that participants perceived on-demand and process feedback modes as effective, but automatic and outcome feedback as slightly more persuasive. We discuss implications for designing AI-augmented storytelling systems that adapt their feedback modes to the diverse needs and expectations of students.
[HC-62] Video-based Social Interaction Behavior Analysis with the Simulated Interaction Task for Children (Kids-SIT)
链接: https://arxiv.org/abs/2605.16270
作者: Rituja Pardhi,Matthias Norden,William Saakyan,Nadine Vietmeier,Simone Kirst,Isabel Dziobek,Julia Asbrand,Hanna Drimalla
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Accurately quantifying children’s social interaction behavior is part of understanding their cognitive and emotional development, as well as mental health conditions. Kids-SIT is a web-based tool designed to computationally analyze children’s behaviors by engaging them in a standardized video conversation scenario while their responses are video recorded. In a pre-registered study with 21 healthy children, we evaluated the potential of the Kids-SIT as an accessible paradigm for automated analysis of children’s social interaction behavior. We assessed their subjective impression, as well as verbal and non-verbal responses during the Kids-SIT. Verbal content was analyzed using the LIWC tool. Three socially relevant non-verbal behaviors (gaze deviation, smiling, and nodding) were manually annotated and automatically extracted using three computational methods. We examined how well these methods capture naturalistic social interaction patterns of healthy children. We conducted an exploratory classification of healthy children (n=21) and those with social anxiety disorder (n=11) using automated behavioral features. The semantic analysis of the children’s verbal responses and their post-hoc impressions indicated that the Kids-SIT successfully elicited natural social interaction behavior. Children’s non-verbal behavior also showed similar pattern: they looked at their interaction partner for most of the time, particularly while listening than speaking. Smiling and gazing toward the partner occurred more frequently during the person-directed liked and disliked parts than during the picture-description phase. These non-verbal behavior patterns were captured both by manual annotations and by the computational analysis methods. In the exploratory analysis with a clinical sample, automatically extracted features enabled above-chance differentiation between children with and without SAD (AUC=0.74).
[HC-63] rain the Trainers – An Agent ic AI Framework for Peer-Based Mental Health Support in Battlefield Environments
链接: https://arxiv.org/abs/2605.16269
作者: Atmaram Yarlagadda,Eranga Bandara,Ross Gore,Anita H. Clayton,Preston Samuel,Christopher K. Rhea,Sachin Shetty,Ravi Mukkamala,Xueping Liang,Amin Hass,Abdul Rahman
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern military operations expose soldiers to sustained psychological stress, leading to acute reactions, post-traumatic stress symptoms, and other mental health issues. Although the U.S. Department of Defense offers evidence-based therapies, access to trained professionals in forward-deployed and contested environments is limited. As a result, soldiers with early-stage distress are often evacuated to rear medical facilities, delaying care, reducing readiness, and increasing long-term risks. This paper proposes a Train-the-Trainers framework in which soldiers who have completed therapy and returned to duty are trained as peer facilitators to provide first-line psychological support in operational settings. To scale and standardize this model under severe resource and connectivity constraints, we introduce an agentic AI-enabled platform that augments these recovered soldiers with specialized AI agents. The recovered soldier acts as a human supervisor, coordinating agents for symptom triage, guided peer-support interventions, operational constraint reasoning, training and simulation, and structured documentation for clinical escalation when needed. The AI agents use consensus-driven decision support in high-stakes environments. The architecture functions in air-gapped and low-connectivity settings, maintaining human oversight and ethical safeguards. A functional prototype was developed with the McDonald U.S. Army Health Center, Newport News, VA, USA. By combining peer-based intervention with consensus-driven agentic AI decision support, the framework seeks to cut response times, prevent symptom escalation, reduce unnecessary evacuations, and improve continuity of care. This work shows how agentic AI can serve as a force multiplier for mental health support in austere environments and identifies pathways for broader evaluation and deployment across defense and humanitarian operations.
[HC-64] Helping Customers in Distress: An LLM -powered Agent that Converses Probes and Routes
链接: https://arxiv.org/abs/2605.16268
作者: Alankar Atreya,Stefan Sylvius Wanger,Devesh Batra,Robert Hankache,Cristovao Iglesias Jr,Patrick Sinclair,Giulio Pelosio,Michael McMillan,Greig A. Cowan,Raad Khraishi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent’s modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system’s accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.
[HC-65] LLM -Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging
链接: https://arxiv.org/abs/2605.16264
作者: Nilesh Agrawal
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 17 pages, 1 figure, 7 tables. Code available at this https URL
Abstract:Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning. Comments: 17 pages, 1 figure, 7 tables. Code available at this https URL Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL) ACMclasses: H.3.5; I.2.7 Cite as: arXiv:2605.16264 [cs.HC] (or arXiv:2605.16264v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.16264 Focus to learn more arXiv-issued DOI via DataCite
[HC-66] Quantum Futures Interactive: A Live Demonstration of Post-Quantum Blockchain Security Infrastructure Tradeoffs and Sustainable Distributed Trust
链接: https://arxiv.org/abs/2605.15991
作者: Dongping Liu,Aoyu Zhang,Luyao Zhang
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:
Abstract:Advances in quantum computing introduce long-term security challenges for widely deployed public-key cryptographic systems used across blockchain platforms and decentralized applications. Although post-quantum cryptography (PQC) standards are emerging, understanding quantum risk remains fragmented across research, engineering, governance, and investment communities. This demo presents Quantum Futures Interactive, a live interdisciplinary demonstration platform combining educational visualization, participatory interaction, and cryptographic artifact generation to illustrate the transition from classical to quantum-resilient blockchain systems. Participants engage in a structured interaction flow including quantum threat education, sentiment capture, technology prioritization, infrastructure tradeoff exploration, and generation of post-quantum cryptographic outputs. The system integrates distributed trust concepts, sustainability-aware infrastructure considerations, and responsible innovation within an interactive decision framework. The demonstration supports interdisciplinary dialogue on blockchain resilience while aligning with United Nations Sustainable Development Goals (SDGs).
计算机视觉
[CV-0] Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
链接: https://arxiv.org/abs/2605.18754
作者: Soumava Paul,Prakhar Kaushik,Alan Yuille
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page at this https URL
Abstract:Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to 3\times more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to 4\times higher correlation with human judgments than MEt3R.
[CV-1] WavFlow: Audio Generation in Waveform Space KR
链接: https://arxiv.org/abs/2605.18749
作者: Feiyan Zhou,Luyuan Wang,Shoufa Chen,Zhe Wang,Zhiheng Liu,Yuren Cong,Xiaohui Zhang,Fanny Yang,Belinda Zeng
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.
[CV-2] Aurora: Unified Video Editing with a Tool-Using Agent
链接: https://arxiv.org/abs/2605.18748
作者: Yongsheng Yu,Ziyun Zeng,Zhiyuan Xiao,Zhenghong Zhou,Hang Hua,Wei Xiong,Jiebo Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer’s conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: this https URL
[CV-3] LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
链接: https://arxiv.org/abs/2605.18739
作者: Yukang Chen,Luozhou Wang,Wei Huang,Shuai Yang,Bohan Zhang,Yicheng Xiao,Ruihang Chu,Weian Mao,Qixin Hu,Shaoteng Liu,Yuyang Zhao,Huizi Mao,Ying-Cong Chen,Enze Xie,Xiaojuan Qi,Song Han
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Code, model, and demos are available at this https URL
Abstract:We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.
[CV-4] Spectral Progressive Diffusion for Efficient Image and Video Generation
链接: https://arxiv.org/abs/2605.18736
作者: Howard Xiao,Brian Chao,Lior Yariv,Gordon Wetzstein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL
Abstract:Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model’s power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.
[CV-5] PIXLRelight: Controllable Relighting via Intrinsic Conditioning
链接: https://arxiv.org/abs/2605.18735
作者: Miguel Farinha,Ronald Clark
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL . Under review
Abstract:We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at this https URL.
[CV-6] EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
链接: https://arxiv.org/abs/2605.18734
作者: Ruiping Liu,Junwei Zheng,Yufan Chen,Di Wen,Shaofang Quan,Chengzhi Wu,Jiaming Zhang,Kailun Yang,Kunyu Peng,Rainer Stiefelhagen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code and dataset can be found at this https URL
Abstract:Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains 2.6K high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E ^2 -Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only 55.3% . E ^2 -Select achieves state-of-the-art performance of 58.2% over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.
[CV-7] Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory
链接: https://arxiv.org/abs/2605.18733
作者: Jinzhuo Liu,Jiangning Zhang,Wencan Jiang,Yabiao Wang,Dingkang Liang,Zhucun Xue,Ran Yi,Yong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39 \times speedup over the most efficient baseline in the 60-second multi-prompt setting.
[CV-8] Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction
链接: https://arxiv.org/abs/2605.18729
作者: Nga Teng Chan,Yi Zhang,Yechi Liu,Renwen Cui,Fanhu Zeng,Zeyuan Ding,Xiancong Ren,Zhang Zhang,Qifeng Chen,Jian Liu,Yong Dai,Xiaozhu Ju
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to “experiential amnesia,” where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.
[CV-9] SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
链接: https://arxiv.org/abs/2605.18719
作者: Komal Kumar,Ankan Deria,Abhishek Basu,Fahad Shamshad,Hisham Cholakkal,Karthik Nandakumar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page 28, Image 20, Table 6
Abstract:Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textitsteering reward mechanism that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07% (vs. 48.9% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08% to 47.83% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: this https URL.
[CV-10] Semantic Generative Tuning for Unified Multimodal Models
链接: https://arxiv.org/abs/2605.18714
作者: Songsong Yu,Yuxin Chen,Ying Shan,Yanwei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 13 figures
Abstract:Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the this https URL.
[CV-11] A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition CVPR2026
链接: https://arxiv.org/abs/2605.18700
作者: Edwin Arkel Rios,Augusto Christian Surya,Oswin Gosal,Fernando Mikael,Mary Madeline Nicole,Kisoon Jang,Bo-Cheng Lai,Min-Chun Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 4 figures
Abstract:Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \urlthis https URL
[CV-12] CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation CVPR2026
链接: https://arxiv.org/abs/2605.18680
作者: Rajeev Goel,Jason Ding,Phani Harish Wajjala,Pavan Turaga,Tejaswi Gowda,Krishna C. Garikipati
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshop (GRAIL-V)
Abstract:Metaverse platforms rely on creator-driven marketplaces where avatars are assembled from discrete, taxonomy-labeled 3D assets (e.g., tops, bottoms, shoes, accessories) under strict category and topology constraints. While users increasingly expect free-form text control, text-only retrieval is brittle: natural language is ambiguous with respect to platform taxonomies, metadata is often noisy or informal, and independently retrieved components can be stylistically inconsistent or geometrically incompatible. We propose \textbfCMAG, a concept-scaffolded retrieval and verified composition framework for marketplace avatar generation. Given a prompt, CMAG first synthesizes an intermediate 3D concept scaffold that disambiguates intent beyond text by providing global spatial and stylistic context. In parallel, a view-aware part discovery module extracts localized visual evidence via prompt decomposition and text-grounded segmentation. A prompt-conditioned taxonomy router enforces category coverage and resolves semantic-to-taxonomic mismatch, after which a hybrid category-wise retriever combines part-based fusion with a concept-residual fallback using feature suppression. Finally, an agentic vision–language model filters and re-ranks candidates across categories and drives an iterative verification loop to assemble prompt-faithful, topologically consistent avatars from catalog assets. We evaluate CMAG on diverse compositional prompts and demonstrate improved retrieval robustness and compositional correctness compared to strong baselines, highlighting the importance of 3D concept scaffolding under prompt ambiguity.
[CV-13] Lance: Unified Multimodal Modeling by Multi-Task Synergy
链接: https://arxiv.org/abs/2605.18678
作者: Fengyi Fu,Mengqi Huang,Shaojin Wu,Yunsheng Jiang,Yufei Huo,Hao Li,Yinghang Song,Fei Ding,Jianzhu Guo,Qian He,Zheren Fu,Zhendong Mao,Yongdong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 34 pages, 14 figures, 10 tables, homepage url: this https URL , code url: this https URL
Abstract:We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at this https URL.
[CV-14] Better Together: Evaluating the Complementarity of Earth Embedding Models
链接: https://arxiv.org/abs/2605.18667
作者: Thijs L van der Plas,Jacob JW Bakermans,Vishal Nedungadi,Gabrielė Tijūnaitytė,Marc Rußwurm,Ioannis N Athanasiadis
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth’s surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.
[CV-15] MementoGUI: Learning Agent ic Multimodal Memory Control for Long-Horizon GUI Agents
链接: https://arxiv.org/abs/2605.18652
作者: Ziyun Zeng,Hang Hua,Bocheng Zou,Mu Cai,Rogerio Feris,Jiebo Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, 15 pages, 4 figures, 5 tables
Abstract:Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbfMementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbfMementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbfMementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
[CV-16] Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video
链接: https://arxiv.org/abs/2605.18645
作者: Arslan Artykov,Tom Ravaud,Nicolás Violante-Grezzi,Vincent Lepetit
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: this https URL
[CV-17] Leverag ing Latent Visual Reasoning in Silence
链接: https://arxiv.org/abs/2605.18641
作者: Dongyao Zhu,Zhen Wang,Xi Xiao,Han Jiang,Saeed Vahidian,Wei-Lun Chao,Tanya Berger-Wolf,Yu Su,Raju Vatsavai,Jianyang Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \hrefthis https URLGitHub and \hrefthis https URLHugging Face.
[CV-18] SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
链接: https://arxiv.org/abs/2605.18636
作者: Wencan Jiang,Jiangning Zhang,Jianbiao Mei,Jinzhuo Liu,Yu Yang,Xiaobin Hu,Zhucun Xue,Yong Liu,Dacheng Tao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.
[CV-19] CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLM s with Dataset Model and Benchmark
链接: https://arxiv.org/abs/2605.18621
作者: Wei Wang,Yuqian Yuan,Tianwei Lin,Wenqiao Zhang,Siliang Tang,Jun Xiao,Yueting Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception - Alignment - Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at this https URL.
[CV-20] ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics ICML2026
链接: https://arxiv.org/abs/2605.18617
作者: Ziyu Wei,Luting Wang,Chen Gao,Li Wen,Si Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICML 2026
Abstract:Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft includes an automated pipeline that generates 6,300 diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at this https URL.
[CV-21] CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic
链接: https://arxiv.org/abs/2605.18610
作者: Shen Lin,Junhao Dong,Rongjie Chen,Xiaoyu Zhang,Li Xu,Xiaofeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.
[CV-22] Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging CVPR2026
链接: https://arxiv.org/abs/2605.18608
作者: Zhilin Zhu,Yabin Wang,Zhiheng Ma,Yaguang Song,Yaowei Wang,Xiaopeng Hong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \hrefthis https URL.
[CV-23] Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
链接: https://arxiv.org/abs/2605.18603
作者: Yuhuan Wu,Cong Wei,Fangzhen Lin,Wenhu Chen,Haozhe Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception – the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth – restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes – serving as a minimal, plug-in modification to standard post-training pipelines – models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.
[CV-24] Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
链接: https://arxiv.org/abs/2605.18601
作者: Shangwen Zhu,Qianyu Peng,Zhao Pu,Zhilei Shu,Xiangrui Ke,Zhaohu Xing,Zizhao Tong,Zeqing Wang,Xinyu Cui,Huangji Wang,Jian Zhao,Yeying Jin,Fan Cheng,Ruili Feng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at this https URL, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.
[CV-25] Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
链接: https://arxiv.org/abs/2605.18599
作者: Yihang Wu,Yihang Sun,Shaofeng Zhang,Zuxuan Wu,Junchi Yan,Xiaosong Jia,Yu-gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 11 figures, 4 tables. Project page: this https URL
Abstract:Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.
[CV-26] OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
链接: https://arxiv.org/abs/2605.18577
作者: Ruixiang Zhao,Jie Yang,Zijie Xin,Tianyi Wang,Fengyun Rao,Jing LYU,Xirong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
[CV-27] StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
链接: https://arxiv.org/abs/2605.18553
作者: Huajian Zeng,Chaohua Yao,Yuantai Zhang,Jiaqi Yang,Rolandos Alexandros Potamias,Xingxing Zuo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.
[CV-28] LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift
链接: https://arxiv.org/abs/2605.18541
作者: Haozhe Si,Yuxuan Wan,Yuqing Wang,Minh Do,Han Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from O(N^2 C^2) to O(rNC) , where N is the number of spatial tokens, C is the number of spectral channels, and r is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.
[CV-29] Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification
链接: https://arxiv.org/abs/2605.18522
作者: Farnaz Kheiri,Shahryar Rahnamayan,Masoud Makrehchi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.
[CV-30] Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation ICML2026
链接: https://arxiv.org/abs/2605.18507
作者: Jingyun Fu,Zhiyu Xiang,Na Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026
Abstract:Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar’s inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar’s intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \hrefthis https URLthis https URL.
[CV-31] Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks
链接: https://arxiv.org/abs/2605.18491
作者: Jue Jiang,Harini Veeraraghavan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper submitted to Medical Physics for review
Abstract:Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10,412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets. Comments: Paper submitted to Medical Physics for review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.18491 [cs.CV] (or arXiv:2605.18491v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.18491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-32] InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
链接: https://arxiv.org/abs/2605.18467
作者: Haojie Zheng,Yixin Yang,Siqi Yang,Shuchen Weng,Boxin Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: this https URL.
[CV-33] Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI
链接: https://arxiv.org/abs/2605.18466
作者: Daiqi Liu,Lukas Mulzer,Md Hasan,Nyvenn de Castro,Fangxu Xing,Xingjian Kang,Chengze Ye,Siyuan Mei,Yipeng Sun,Tomás Arias-Vergara,Jana Hutter,Jonghye Woo,Andreas Maier,Paula Andrea Pérez-Toro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.
[CV-34] PERL: Parameter Efficient Reasoning in CLIP Latent Space NEURIPS2026
链接: https://arxiv.org/abs/2605.18464
作者: Simone Carnemolla,Salvatore Calcagno,Daniela Giordano,Concetto Spampinato,Matteo Pennisi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to NeurIPS 2026
Abstract:Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP’s pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.
[CV-35] Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agent ic Code Synthesis
链接: https://arxiv.org/abs/2605.18451
作者: Yixuan Yang,Zhen Luo,Wanshui Gan,Jinkun Hao,Junru Lu,Jinghao Yan,Zhaoyang Lyu,Xudong Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.
[CV-36] NeRF-based Spacecraft Reconstruction from Close-Range Monocular Imagery Under Illumination Variability and Pose Uncertainty
链接: https://arxiv.org/abs/2605.18447
作者: Antoine Legrand,Renaud Detry,Christophe De Vleeschouwer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.
[CV-37] A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation
链接: https://arxiv.org/abs/2605.18436
作者: Pau Torras,Jiří Mayer,Carles Badal,Martina Dvořáková,Markéta Herzanová Vlková,Gerard Asbert,Vojtěch Dvořák,Samuel Šomorjai,Jan Hajič jr.,Alicia Fornés
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at Scientific Data
Abstract:A large amount of musical heritage has been digitised by memory institutions: libraries, museums, and archives. Nevertheless, the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable, despite advances in deep learning, mostly because no datasets for training systems in realistic conditions were available. The MusiCorpus dataset aims to remedy this situation by providing 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations. It is the largest dataset of handwritten music to date and the first dataset containing a realistic and representative sample of musical document collections from memory institutions, suitable for training and evaluating both end-to-end and object detection-based OMR systems and comparing their performance.
[CV-38] Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
链接: https://arxiv.org/abs/2605.18431
作者: Kunyu Peng,Zhikun Zhou,Kailun Yang,Di Wen,Ruiping Liu,Yufan Chen,Junwei Zheng,Hao Shi,Yi Zhou,M. Saquib Sarfraz,Danda Pani Paudel,Luc Van Gool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at this https URL.
[CV-39] Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology
链接: https://arxiv.org/abs/2605.18419
作者: Franciskus Xaverius Erick,Johanna Paula Müller,Bernhard Kainz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM’s joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.
[CV-40] Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models
链接: https://arxiv.org/abs/2605.18413
作者: Nicola Farronato,Niccolo Avogaro,Thomas Frick,Mattia Rigotti,Rizwan Ullah Khan,Michele Magno,Konrad Schindler,Cristiano Malossi,Florian Scheidegger
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising \approx 150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today’s specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at \approx 25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.
[CV-41] Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival
链接: https://arxiv.org/abs/2605.18408
作者: Neofytos Dimitriou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate vessel estimated-time-of-arrival forecasts are critical for port operations and decarbonization, yet global-scale travel-time prediction remains difficult without costly contextual data. Herein, I present a methodology for constructing a historical maritime knowledge graph using only Automatic Identification System (AIS) data. First, segmented trajectories are extracted from noisy AIS data using a Gaussian-mixture-model-based preprocessing pipeline. The graph is then constructed by iteratively processing the trajectories and storing speed distributions stratified by vessel type, time of travel, and direction of travel; the resulting global graph comprises 5,433 geohash-3 nodes and 12,334 edges. The graph can be queried to retrieve travel-time predictions between any two location via a hierarchical, priority-based system that uses historical statistics with principled fallback. On a temporally held-out test set, median RMSE is 22.75 min (segment-level) and 30.90 min (trajectory-level), with 69.1% of trajectories within 20% of actual arrival time. On a second external test set, median RMSE is 27.36 min (segment-level) and 37.46 min (trajectory-level), with 62.1% of trajectories within 20%. These results corroborate the promise of our method, enabling global travel-time prediction and providing a strong foundation for just-in-time arrival planning and emissions reduction.
[CV-42] Generalize cross-ratios in n-dimensional Plane-Based Geometric Algebra
链接: https://arxiv.org/abs/2605.18398
作者: Enzo Harquin(LIGM),Stephane Breuils(LAMA),Pascal Monasse(ENPC),Venceslas Biri(LIGM),Vincent Nozick(LIGM)
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We develop a complete theory of projective cross-ratios in n-dimensional Plane-Based Geometric Algebra (PGA), R(n,0,1), covering geometric objects of every grade: finite and ideal points, hyperplanes, and intermediate flats. For each object type and configuration, we establish an explicit cross-ratio formula, prove that it recovers the appropriate classical invariant, and identify the canonical pairwise measurement operator. A systematic duality analysis further revealed that all eight configurations organize into four dual pairs under the Hodge dual, and that all measurement operators reduce to either the commutator or the commutator dual, depending solely on the geometric configuration rather than on object grade. In each case the formula recovers the appropriate classical invariant: signed distance ratios for parallel configurations and sine cross-ratios for secant ones. These results establish the cross-ratio as a grade-agnostic projective invariant within PGA, and provide a constructive foundation for defining n-dimensional homographies directly from prescribed invariants.
[CV-43] NEWTON: Agent ic Planning for Physically Grounded Video Generation
链接: https://arxiv.org/abs/2605.18396
作者: Yuxiang Feng,Juncheng Wang,Chao Xu,Yijie Qian,Huihan Wang,Wenlong Hou,Yang Liu,Baigui Sun,Yong Liu,Shujun Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Video generation models produce visually compelling results but systematically violate physical commonsense – on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy – sufficiency, dynamism, and verifiability – and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent’s toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: \hrefthis https URLthis https URL
[CV-44] Vision Foundation Models as Generalist Tokenizers for Image Generation
链接: https://arxiv.org/abs/2605.18390
作者: Anlin Zheng,Qi Han,Xin Wen,Chuofan Ma,Lanxi Gong,Gang Yu,Xiangyu Zhang,Xiaojuan Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 figures and 14 tables
Abstract:In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM’s representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf3 times and achieves a state-of-the-art gFID of \textbf1.36 on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf1.25. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbfw/o CFG) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.
[CV-45] GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
链接: https://arxiv.org/abs/2605.18365
作者: Jan Ackermann,Shengqu Cai,Boyang Deng,Zhengfei Kuang,Songyou Peng,Gordon Wetzstein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.
[CV-46] RAVE: Re-Allocating Visual Attention in Large Multimodal Models
链接: https://arxiv.org/abs/2605.18359
作者: Xi Leng,Xinhong Ma,Ziqiang Dong,Feng Zhang,Xiaoying Tang,Yang Yang,Guanjun Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query–key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks – including multilingual OCR, chart understanding, document VQA, and scene text VQA – where accurate visual grounding is critical.
[CV-47] Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport
链接: https://arxiv.org/abs/2605.18349
作者: Aida Rostamza,Enrico Del Re,Joshua Cherian Varughese,Cristina Olaverri-Monreal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.
[CV-48] Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion
链接: https://arxiv.org/abs/2605.18346
作者: Peiliang Cai,Evelyn Zhang,Jiacheng Liu,Hao Lin,Ruiqi Zhang,Weile Mo,Yue Ma,Shikang Zheng,Jiehang Huang,Dongrui Liu,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbfFocused Forcing, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to \textbf1.48\times end-to-end acceleration without training, while \textbfimproving visual quality and text alignment. \textitOur code will be released on GitHub.
[CV-49] 3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine
链接: https://arxiv.org/abs/2605.18334
作者: Beizhen Zhao,Yifan Zhou,Gaochao Song,Ziran Yin,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 16 pages
Abstract:While 3D Gaussian Splatting (3DGS) has revolutionized real-time photorealistic view synthesis, its fundamental reliance on symmetric Gaussian distributions introduces visual artifacts that hinder accurate spatial data exploration. Specifically, symmetric kernels struggle to capture shape and color discontinuities , which cause blurriness and primitive redundancy that mislead human perception during visual analysis. To address these visualization barriers, we introduce 3D Skew Gaussian Splatting (3DSGS), a novel framework that significantly enhances the structural fidelity and compactness of explicit scene representations. Our key insight lies in extending the standard primitive to a general Skew Gaussian counterpart. This generalized primitive inherits the highly efficient rasterization properties of standard Gaussians while gaining intrinsic asymmetric modeling capabilities. We couple this with an enhanced opacity representation to better handle complex transparency, alongside a depth-aware densification strategy that intelligently manages primitive allocation. Furthermore, to make these advancements actionable for real-world visual analytics, we re-derive the CUDA rasterization pipeline to universally support both symmetric and skew Gaussians, integrating it into a decoupled, free-camera interactive visualization engine. Extensive experiments demonstrate that 3DSGS achieves superior rendering quality and structural compactness, particularly in regions with intricate details, while maintaining the real-time frame rates necessary for fluid interactive exploration. Supplementary derivations and visual results are available at \textbf\textitthis https URL.
[CV-50] Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
链接: https://arxiv.org/abs/2605.18329
作者: Kirscher Tristan(ICube,DKFZ),Bujotzek Markus(DKFZ),Kirchhoff Yannick(DKFZ),Rokuss Maximilian(DKFZ),Isensee Fabian(DKFZ),Kahl Kim-Celine(DKFZ),Kovacs Balint(DKFZ),Maier-Hein Klaus(DKFZ)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles’’ (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology–implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.
[CV-51] CineMatte: Background Matting for Virtual Production and Beyond
链接: https://arxiv.org/abs/2605.18328
作者: Yuanjian He,Chen Zhang,Fasheng Chen,Jiangbo Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional “detail branch” to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.
[CV-52] Improved Baselines with Representation Autoencoders
链接: https://arxiv.org/abs/2605.18324
作者: Jaskirat Singh,Boyang Zheng,Zongze Wu,Richard Zhang,Eli Shechtman,Saining Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for “free”. Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID = k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at this https URL.
[CV-53] Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering
链接: https://arxiv.org/abs/2605.18313
作者: Luca Hagen,Johanna P. Müller,Weitong Zhang,Mengyun Qiao,Bernhard Kainz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at this https URL Wasserstein-BDG-medical-VQA.
[CV-54] PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
链接: https://arxiv.org/abs/2605.18303
作者: Xueyu Luan,Chenwei Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 12 pages, 3 figures
Abstract:World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.
[CV-55] Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing ICIP2026
链接: https://arxiv.org/abs/2605.18288
作者: Anh-Kiet Duong,Petra Gomez-Krämer,Jean-Michel Carozza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, accepted to ICIP 2026
Abstract:Unsupervised fine-grained image hashing aims to learn compact binary codes that preserve subtle visual differences among highly similar instances without manual annotations. However, most existing methods neglect collision resistance, leading to identical hash codes for slightly semantically different samples. In this paper, we propose Collision-Resistant Single-Pass Self-Supervised Semantic Hashing (CS3H), a collision-resistant framework that directly optimizes Hamming-space similarity via a single-pass normalized Hamming distance loss to produce well-separated binary representations. We further introduce a collision-sensitive attention module to emphasize rare and discriminative local patterns, reducing hash collisions and improving fine-grained discrimination. Experiments on multiple benchmarks show that CS3H consistently outperforms state-of-the-art methods in retrieval accuracy while achieving superior collision resistance with minimal computational overhead.
[CV-56] StableVLA: Towards Robust Vision-Language-Action Models without Extra Data ICML2026
链接: https://arxiv.org/abs/2605.18287
作者: Yiyang Fu,Chubin Zhang,Shukai Gong,Yufan Deng,Kaiwei Sun,Qiyang Min,Qibin Hou,Yansong Tang,Jianan Wang,Daquan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICML 2026. Code: this https URL . Project website: this https URL
Abstract:It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.
[CV-57] SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation
链接: https://arxiv.org/abs/2605.18267
作者: Longtao Jiang,Jiangmin Bao,Zhendong Wang,Xin Tao,Pengfei Wan,Zhihui Li,Xiaojun Chang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet 256 \times 256 and 512 \times 512 , SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at this https URL.
[CV-58] RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting CVPR2026
链接: https://arxiv.org/abs/2605.18263
作者: Ji Shi,Xianghua Ying,Bowei Xing,Ruohao Guo,Wenzhen Yue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Highlight, Project Page: this https URL
Abstract:3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian’s geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at this https URL.
[CV-59] GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
链接: https://arxiv.org/abs/2605.18252
作者: Jiale Shi,Jiarui Hu,Zesong Yang,Kaixuan Luan,Hujun Bao,Zhaopeng Cui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.
[CV-60] Non-Colliding Biometric Identities for Digital Entities: Geometry Capacity and Million-Scale Virtual Identity Provisioning
链接: https://arxiv.org/abs/2605.18238
作者: Yuyang Ji,Yixuan Shen,Anil Jain,Xiaoming Liu,Feng Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 11 figures
Abstract:Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.
[CV-61] Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos ICML2026
链接: https://arxiv.org/abs/2605.18233
作者: X. Feng,J. Zhu,M. Wu,C. Chen,F. Mao,H. Guo,J. Wu,X. Chu,K. Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026~
Abstract:Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbfMIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at this https URL.
[CV-62] EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation
链接: https://arxiv.org/abs/2605.18214
作者: Rosario Leonardi,Francesco Ragusa,Daniele Materia,Alessandro Passanisi,James Fort,Jakob Engel,Giovanni Maria Farinella
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.
[CV-63] SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning CVPR2026
链接: https://arxiv.org/abs/2605.18209
作者: Pawat Chunhachatrachai,Gueter Josmy Faure,Hung-Ting Su,Winston H. Hsu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026
Abstract:Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template – without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.
[CV-64] RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots
链接: https://arxiv.org/abs/2605.18197
作者: Giorgia Modi,Davide Buoso,Giuseppe Averta,Daniele De Martini
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.
[CV-65] Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
链接: https://arxiv.org/abs/2605.18194
作者: Yajing Zhou,Xiangyu Kong
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures
Abstract:While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a “Cartesian Illusion” - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B’s belief about the environment, governed strictly by Agent B’s physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B’s estimation of A’s relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a “geometric-to-semantic” projection, forcing it to first establish B’s local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B’s visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.
[CV-66] Best Segmentation Buddies for Image-Shape Correspondence CVPR2026
链接: https://arxiv.org/abs/2605.18193
作者: Itai Lang,Dongwei Lyu,Dale Decatur,Rana Hanocka
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026. Project page: this https URL
Abstract:Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at this https URL.
[CV-67] View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification CVPR2026
链接: https://arxiv.org/abs/2605.18192
作者: Quan Zhang,Zeqiang Cai,Peiming Zhao,Jingze Wu,Cailun Wu,Hongbo Chen,Jianhuang Lai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 POSTER
Abstract:Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \hrefthis https URLthis https URL.
[CV-68] Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network
链接: https://arxiv.org/abs/2605.18190
作者: Grigory Bartosh,David Ruhe,Emiel Hoogeboom,Jonathan Heek,Thomas Mensink,Tim Salimans
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of 2 - 4 . Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.
[CV-69] Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
链接: https://arxiv.org/abs/2605.18184
作者: Giorgia Modi,Davide Buoso,Giuseppe Averta,Daniele De Martini
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.
[CV-70] oken-Space Mask Prediction for Efficient Vision Transformer Segmentation CVPR
链接: https://arxiv.org/abs/2605.18177
作者: Calvin Galagain,Martyna Poreba,François Goulette
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR, EVW 2026
Abstract:Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.
[CV-71] MARS: Technical Report for the CASTLE Challenge at EgoVis 2026
链接: https://arxiv.org/abs/2605.18176
作者: Haoyu Zhang,Qiaohui Chu,Yisen Feng,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The Runner-up Solution for CASTLE Challenge @ EgoVis 2026
Abstract:This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at this https URL.
[CV-72] Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
链接: https://arxiv.org/abs/2605.18173
作者: Antonio Colombo,Giovanni Bianchi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02% in full-lexicon accuracy without additional training data, and obtains competitive 83.4% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.
[CV-73] Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
链接: https://arxiv.org/abs/2605.18162
作者: Junming Liu,Yuqi Li,Yifei Sun,Maonan Wang,Piotr Koniusz,Yirong Chen,Ding Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures, 3 tables
Abstract:Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.
[CV-74] Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
链接: https://arxiv.org/abs/2605.18160
作者: Xinpeng Dong,Min Zhang,Kairong Han,Xu Tan,Fei Wu,Kun Kuang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model’s dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model’s output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at this https URL.
[CV-75] Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares
链接: https://arxiv.org/abs/2605.18156
作者: Xiyu Zhu,Wei Wang,Kui Jiang,Zhengguo Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.
[CV-76] Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
链接: https://arxiv.org/abs/2605.18137
作者: Lijun Zhou,Hongcheng Luo,Zhenxin Zhu,Cheng Chi,Mingfei Tu,Kaixin Xiong,Lei Gong,Zhanqian Wu,Zehan Zhang,Fangzhen Li,Hao Li,Yingying Shen,Jiale He,Haohui Zhu,Shan Zhao,Kai Wang,Zhiwei Zhan,Yuechuan Pu,Kaiyuan Tan,Ruiling Yang,Xianqi Wang,Tianyi Yan,Jiawei Zhou,Lei Zhang,Jingyang Zhao,Xi Zhou,Chitian Sun,Chenming Wu,Jiong Deng,Hongwei Xie,Ming Lu,Kun Ma,Long Chen,Guang Chen,Hangjun Ye,Bing Wang,Haiyang Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.
[CV-77] Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models
链接: https://arxiv.org/abs/2605.18132
作者: Sihan Ma,Siyuan Liang,Dacheng Tao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.
[CV-78] Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis
链接: https://arxiv.org/abs/2605.18130
作者: Fengyi Zhang,Xujie Zeng,Mohan Liu,Zengyi Wang,Yalong Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.
[CV-79] WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
链接: https://arxiv.org/abs/2605.18115
作者: Yiwei Guo,Shaobin Zhuang,Zhipeng Huang,Canmiao Fu,Chen Li,Jing Lyu,Yali Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at this https URL.
[CV-80] askGround: Structured Executable Task Inference for Full-Scene Household Reasoning
链接: https://arxiv.org/abs/2605.18109
作者: ZhiYuan Feng,Yu Deng,Ruichuan An,Zhenhua Liu,Qixiu Li,Keming Wu,Zhiying Du,Weijie Wang,Haoxiao Wang,Shuang Chen,Sicheng Xu,Yaobo Liang,Jiaolong Yang,Baining Guo
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.
[CV-81] DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos
链接: https://arxiv.org/abs/2605.18102
作者: Wenhao Shen,Ming Zhou,Hengyuan Zhang,Siyuan Bian,Youjiang Xu,Xi Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.
[CV-82] SENSE: Satellite-based ENergy Synthesis for Sustainable Environment KDD2026
链接: https://arxiv.org/abs/2605.18101
作者: Kailai Sun,Mingyi He,Heye Huang,Can Rong,Alok Prakash,Baoshen Guo,Shenhao Wang,Jinhua Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accpted by KDD 2026 (Oral)
Abstract:Urban Building Energy Modeling plays a critical role in achieving the United Nations’ Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: this https URL and this https URL.
[CV-83] he MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
链接: https://arxiv.org/abs/2605.18063
作者: Corentin Dumery,Niki Amini-Naieni,Shervin Naini,Pascal Fua
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Co-first authors. Dataset and project page this https URL
Abstract:Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.
[CV-84] Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters
链接: https://arxiv.org/abs/2605.18060
作者: Mohsine El Khayati,Rachid Elouahbi,Abdelillah Semma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026
Abstract:Arabic Handwritten Character Recognition (AHCR) has recently advanced significantly with deep Convolutional Neural Networks (ConvNets). However, many models in the literature are deep and computationally expensive in terms of parameters and FLOPs, limiting their deployment on resource-constrained devices, which are increasingly common. This study addresses this gap by proposing a combination of lightweight embedded ConvNet models and ensemble learning techniques. Extensive experiments were conducted to identify best practices in AHCR, considering training hyperparameters, learning strategies, model choices, and ensemble methods. Results show that embedded models can achieve accuracy comparable to, or even surpassing, heavier architectures. Ensemble learning further enhances performance with only modest computational overhead, particularly under challenging training scenarios. Among the ensembling strategies, soft voting yielded the best overall results.
[CV-85] hreats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models
链接: https://arxiv.org/abs/2605.18058
作者: Mohsine EL Khayati,Abdelillah Semma,Abdelaziz Courr,Rachid Elouahbi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026
Abstract:Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100% on most models. Other, less aggressive attacks achieved success rates of 50-96% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.
[CV-86] Efficient 3D Content Reconstruction and Generation
链接: https://arxiv.org/abs/2605.18052
作者: Jiahao Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.
[CV-87] OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
链接: https://arxiv.org/abs/2605.18041
作者: Morunliu Yang,Ruotao Xu,Le Li,Yue Wang,Jianxin Zhang,Juntao Li,Yihang Lou,Siwei Feng,Peifeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose \textbfOmniSelect , a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.
[CV-88] SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
链接: https://arxiv.org/abs/2605.18039
作者: Soyeon Yoon,Chang Wook Seo,Hyunjung Shim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.
[CV-89] Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels ICIP
链接: https://arxiv.org/abs/2605.18038
作者: Espen Uri Høgstedt,Christian Schellewald,Annette Stahl,Rudolf Mester
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 IEEE International Conference on Image Processing (ICIP)
Abstract:Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon’s lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: this https URL.
[CV-90] What Matters for Grocery Product Retrieval with Open Source Vision Language Models ICPR2026
链接: https://arxiv.org/abs/2605.18029
作者: Emmanuel G. Maminta,Rowel O. Atienza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 28th International Conference on Pattern Recognition (ICPR 2026)
Abstract:Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf(1) Data quality trumps scale. Switching from raw web-scrapes to filtered datasets delivers up to 16.6% accuracy gains, exceeding the benefit of doubling model parameters. \textbf(2) Efficient models can win. MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textitsemantic power density ( \phi ), an efficiency metric that penalizes sub-threshold accuracy. \textbf(3) A precision gap persists. State-of-the-art models achieve 94.5% Recall@5 but suffer a 17.5% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \urlthis https URL.
[CV-91] DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection
链接: https://arxiv.org/abs/2605.18023
作者: Donghong Jiang,Endian Lin,Hanqing Liu,Mingjie Liu,Luoping Cui,Zhao Yang,Chuang Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.
[CV-92] nySAM 2: Extreme Memory Compression for Efficient Track Anything Model
链接: https://arxiv.org/abs/2605.18013
作者: Zhaoyuan Ding,Yijing Yang,Han Shu,Xinghao Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2’s multi-stage image encoder and memory module have raised the barrier to the model’s deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.
[CV-93] SAS: Semantic-aware Sampling for Generative Dataset Distillation
链接: https://arxiv.org/abs/2605.18012
作者: Mingzhuo Li,Guang Li,Linfeng Ye,Jiafeng Mao,Takahiro Ogawa,Konstantinos N. Plataniotis,Miki Haseyama
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a journal paper in IEEE OJSP
Abstract:Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.
[CV-94] Functionalization via Structure Completion and Motion Rectification
链接: https://arxiv.org/abs/2605.18010
作者: Mingrui Zhao,Sai Raj Kishore Perla,Kai Wang,Sauradip Nag,Duc Anh Nguyen,Jiayi Peng,Ruiqi Wang,Angel X. Chang,Manolis Savva,Ali Mahdavi-Amiri,Hao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility (“zero-shot”) and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.
[CV-95] MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization
链接: https://arxiv.org/abs/2605.17997
作者: Le Su,Xing Luo,Zhi Jin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous this http URL, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization this http URL this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across this http URL on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each this http URL avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art this http URL will be made publicly available upon acceptance.
[CV-96] Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution
链接: https://arxiv.org/abs/2605.17980
作者: Bin Luo,Runmin Dong,Zhaoyang Luo,Jinxiao Zhang,Jiyao Zhao,Fan Wei,Haohuan Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.
[CV-97] Generation Navigator: A State-Aware Agent ic Framework for Image Generation
链接: https://arxiv.org/abs/2605.17969
作者: Jinming Liu,Ruoyu Feng,Yuqi Wang,Wenjun Zeng,Xin Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.
[CV-98] A More Word-like Image Tokenization for MLLM s
链接: https://arxiv.org/abs/2605.17954
作者: Hyun Lee,Hyemin Jeong,Yejin Kim,Hyungwook Choi,Hyunsoo Cho,Soo Kyung Kim,Joonseok Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at this https URL.
[CV-99] Counting Machine Parts
链接: https://arxiv.org/abs/2605.17952
作者: Benedict Florance Arockiaraj,Elizabeth Dinella,Ankit Billa,Ajay Anand
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Counting objects in an image is a task applicable across many domains. For instance, crowd counting, inventory counting, and cell counting have been the focus of recent research. The major challenges in estimating the count of objects include overlapping objects, object scale issues, occlusions, and varying lighting conditions. In this report, we explore the problem of counting machine washer parts. Our technique is an extension of FamNet with an additional loss component, trained on the given dataset. We compare to three baseline methods: a traditional image processing pipeline, instance segmentation, and density map estimation. We evaluate the performance of these algorithms by computing the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) between the true object counts and the model outputs. Our approach achieves a performance of 1.96 MAE.
[CV-100] SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
链接: https://arxiv.org/abs/2605.17949
作者: Xiao Yang,Ronghao Fu,Zhiwen Lin,Zhuoran Duan,Jiashun Zhu,Jiasen Hu,Lang Sun,Weipeng Zhang,Jiaqi Liu,Xu Na,Haoran Liu,Weijie Zhang,Bo Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.
[CV-101] SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
链接: https://arxiv.org/abs/2605.17946
作者: Lingtao Mao,Huangyu Dai,Xinyu Sun,Zihan Liang,Ben Chen,Chenyi Lei,Wenwu Ou
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.
[CV-102] UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction
链接: https://arxiv.org/abs/2605.17942
作者: Xiang Yang,Yongli Wang,HaiFeng Li,Yunsheng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures
Abstract:Feed-forward 3D reconstruction has recently demonstrated strong generalization across diverse scenes, yet its performance in UAV imagery remains underexplored due to distinctive acquisition geometries, large viewpoint variations, and ambiguity between horizontal field of view and flight height. We present UAVFF3D, a geometry-aware benchmark for feed-forward UAV 3D reconstruction, comprising over 170K real UAV images and more than 370K high-quality synthetic images. The benchmark also includes a challenging diagnostic test subset designed to analyze model behavior under UAV-specific geometric this http URL on UAVFF3D, we propose an evaluation protocol that jointly assesses camera-geometry estimation and reconstruction accuracy, addressing limitations of existing evaluations that rely on separate alignments. Experiments on four representative feed-forward reconstruction models show that UAV-domain adaptation substantially improves performance, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. Further analysis reveals that domain adaptation mitigates rotation-estimation degradation in oblique-view scenes and improves robustness under horizontal-field-of-view/height ambiguity. Incorporating camera priors further enhances reconstruction performance under UAV-specific acquisition geometries.
[CV-103] AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
链接: https://arxiv.org/abs/2605.17933
作者: Pan Wang,Yihao Hu,Xiujin Liu,Jingchu Yang,Hang Wang,Zhihao Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbfAtlasVA, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textscSokoban, \textscFrozenLake, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: this https URL
[CV-104] An Efficient Streaming Video Understanding Framework with Agent ic Control
链接: https://arxiv.org/abs/2605.17921
作者: Jinming Liu,Jianguo Huang,Zhaoyang Jia,Jiahao Li,Xiaoyi Zhang,Zongyu Guo,Bin Li,Wenjun Zeng,Yan Lu,Xin Jin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.
[CV-105] Domain Transfer Becomes Identifiable via a Single Alignment
链接: https://arxiv.org/abs/2605.17918
作者: Sagar Shrestha,Subash Timilsina,Hoang-Son Nguyen,Xiao Fu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer – requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.
[CV-106] PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis
链接: https://arxiv.org/abs/2605.17916
作者: Jinrang Jia,Zhenjia Li,Yijiang Hu,Yifeng Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17
Abstract:Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is this https URL
[CV-107] SurgLQA: Scalable Long-Horizon Surgical Video Question Answering MICCAI2026
链接: https://arxiv.org/abs/2605.17915
作者: Diandian Guo,Xikai Yang,Ruiyang Li,Jialun Pei,Pheng-Ann Heng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026 Early Accept
Abstract:Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: this https URL.
[CV-108] WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality Functionality and Platform
链接: https://arxiv.org/abs/2605.17912
作者: Yu Shang,Yinzhou Tang,Yiding Ma,Zhuohang Li,Lei Jin,Weikang Su,Xin Jin,Zhaolu Wang,Ziyou Wang,Xin Zhang,Haisheng Su,Weizhen He,Wei Wu,Haoyi Duan,Gordon Wetzstein,Xihui Liu,Dhruv Shah,Zhaoxiang Zhang,Zhibo Chen,Jun Zhu,Yonghong Tian,Tat-Seng Chua,Wenwu Zhu,Chen Gao,Yong Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: this https URL.
[CV-109] One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception ICML2026
链接: https://arxiv.org/abs/2605.17907
作者: Yang Li,Weize Li,Quan Yuan,Congzhang Shao,Guiyang Luo,Yunqi Ba,Xuanhan Zhu,Xinyuan Ding,Xiaoyuan Fu,Jinglin Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:By sharing intermediate features, collaborative perception extends each agent’s sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at this https URL. Comments: 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17907 [cs.CV] (or arXiv:2605.17907v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.17907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-110] Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation
链接: https://arxiv.org/abs/2605.17904
作者: Penghao Jia,Zhiyong Huang,Mingyang Hou,Zhi Yu,Shuai Miao,Jiahong Wang,Yan Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-Shot Medical Image Segmentation (FSMIS) aims to delineate novel anatomical targets from one or a few annotated support images, addressing the annotation scarcity in medical imaging. Notwithstanding recent advancements, current prototype-based methods are bottlenecked by two coupled limitations: 1) cue entanglement, where a single spatial-domain prototype is forced to summarise organ silhouette, parenchymal texture and boundary appearance simultaneously, so any support-query mismatch on one cue propagates indiscriminately to the others; and 2) topology-blind matching, where cosine similarity measures distance in the ambient Euclidean space and ignores the connectivity of the underlying feature manifold, causing fragmented activations inside low-contrast organs and leakage into neighbouring tissues. To this end, we propose Spectral-Geodesic Prototype Network (SGP-Net), built around a Spectral-Geodesic Prototype Module with two coupled components. A Spectral Prototype Bank (SPB) decomposes support and query features into low-, mid- and high-frequency bands via learnable radial Fourier filters, yielding three disentangled prototypes per class that separately encode shape, texture and boundary cues. A Geodesic Matcher (GM) then replaces cosine similarity with a differentiable heat-diffusion approximation of geodesic distance, propagating matching signals along a feature affinity graph so that on-manifold pixels accumulate consistent responses while off-manifold look-alikes are suppressed. Experiments on three public FSMIS benchmarks demonstrate that SGP-Net achieves competitive performance against recent state-of-the-art methods.
[CV-111] HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities
链接: https://arxiv.org/abs/2605.17875
作者: Huawei Jiang,Husna Mutahira,Shibo Wei,Jiahang Li,Vladimir Shin,Juneho Yi,Dongryeol Ryu,Wonyoung Park,Mannan Saeed Muhammad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Scientific Reports
Abstract:The accurate automated diagnosis of cardiac abnormalities from 12-lead electrocardiograms (ECGs) is critical for managing cardiovascular disease. However, detecting concurrent conditions remains a challenge for traditional deep learning models, which often have limited ability to model the long-range dependencies inherent in ECG signals. This manuscript proposes HexagonalWarriorMamba (HWMamba), a framework built on the Mamba architecture that processes 12-lead ECGs as single-channel 2D images rather than conventional 1D time series. By integrating a hierarchical architecture with a 2D Selective Scan mechanism, HWMamba is designed to model global context and complex spatial relationships within the data. The model is evaluated on the PhysioNet/Computing in Cardiology Challenge 2021 dataset, which includes 26 diagnostic labels and comprises recordings collected from seven institutions across four countries and three continents. Results demonstrate that HWMamba outperforms current state-of-the-art (SOTA) methods across five key threshold-dependent metrics, including Challenge Score and Subset Accuracy. These improvements provide a balance between strong discriminative capability and effective threshold selection derived from the training data, while maintaining near-SOTA performance in Macro AUROC. This Hexagonal Warrior performance, reflecting consistent performance across multiple evaluation dimensions, positions HWMamba as a robust and versatile approach for multi-label ECG classification.
[CV-112] PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines
链接: https://arxiv.org/abs/2605.17869
作者: Sivakumar K.S.,Mohammad Daniyalur Rahman,Gopi Raju Matta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
Abstract:A widespread assumption in local feature research holds that classical handcrafted descriptors are accuracy-limited relics best replaced by learned alternatives. We show this is wrong. Through an 8-configuration ablation spanning four benchmarks (HPatches, ROxford5K, IMC Phototourism, MegaDepth), we demonstrate that classical SIFT with DSP multi-scale pooling outperforms neural descriptor and orientation replacements (HardNet, OriNet) on every accuracy metric–while running 2–18 \times faster–and that learned matchers (LightGlue) complement rather than supersede classical features. The conclusion reframes a decade of work: not “replace SIFT” but “compose with SIFT,” classical extraction paired with learned matching only where geometric context demands it. This finding was invisible because no prior GPU SIFT kept the complete pipeline in VRAM or offered modularity for controlled classical-vs-learned ablations. We present PySIFT, the first fully GPU-resident SIFT, implemented in CuPy/Numba CUDA kernels with DLPack zero-copy handoff to downstream DL frameworks–submillisecond O(1) metadata swap regardless of keypoint count. On a laptop-grade NVIDIA RTX 3050 (4 GB VRAM), PySIFT achieves: (i) higher Mean Matching Accuracy (MMA) than OpenCV SIFT on HPatches, (ii) 383 ms faster per pair on high-resolution MegaDepth, (iii) higher geometric accuracy on cross-dataset benchmarks (+5.6 pp AUC@10 ^\circ on MegaDepth, more inliers on IMC Phototourism), and (iv) bitwise deterministic output–identical keypoints and descriptors across runs, with detection reproducing identically even across GPU architectures: a guarantee that learned extractors cannot match without significant performance sacrifice, and cannot achieve at all across GPU architectures due to cuDNN’s architecture-dependent algorithm selection. PySIFT is open-source, requiring no C++ compilation.
[CV-113] Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling
链接: https://arxiv.org/abs/2605.17865
作者: Siddharth Somasundaram,Aaron Young,Akshat Dave,Adithya Pediredla,Ramesh Raskar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ( 100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.
[CV-114] mporal Aware Pruning for Efficient Diffusion-based Video Generation
链接: https://arxiv.org/abs/2605.17837
作者: Sheng Li,Yang Sui,Junhao Ran,Bo Yuan,Yue Dai,Xulong Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers’ diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.
[CV-115] Stabilizing Scaling Enhancing MeanFlow for Large-scale Diffusion Distillation
链接: https://arxiv.org/abs/2605.17834
作者: Xiao He,Yang Li,Peizhen Zhang,Songtao Liu,Zhao Zhong,Nannan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ‘‘mean-seeking bias’’ have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ‘‘mean-seeking bias’’ of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model’s trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.
[CV-116] CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
链接: https://arxiv.org/abs/2605.17826
作者: Reem Alzahrani,Hassan Alshanqiti,Bushra Bin Hemid,Zaid Alyafeai,Abdelrahman Eldesokey,Bernard Ghanem
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.
[CV-117] Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
链接: https://arxiv.org/abs/2605.17823
作者: Shravan Murlidaran,Ziqi Wen,Sana Shehabi,Miguel P. Eckstein
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.
[CV-118] Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection
链接: https://arxiv.org/abs/2605.17822
作者: Yixing Yong,Jian Wang,Ming Lei,Lijun He,Fan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack this http URL this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at this https URL.
[CV-119] Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns
链接: https://arxiv.org/abs/2605.17818
作者: Xi Chen(1),Yingjun Xiao(2),Gang Fang(3) ((1) School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China, (2) School of Artificial Intelligence, Guangzhou University, Guangzhou, China, (3) Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures,8 tables
Abstract:Open-set recognition systems face a neglected failure mode: high-confidence near-known unknowns, which lie outside the known label set but are close enough to known classes that a closed-set classifier accepts them with high confidence. We show that this failure is widespread across scalar-threshold methods, including recent post-hoc detectors, and that stronger encoders can amplify rather than remove the risk. We propose EGUR-A, which changes the decision from is this sample's score high enough?'' to does this predicted known class have sufficient evidence to accept this sample?‘’ EGUR-A combines class-conditional local acceptance evidence with global residual evidence, and selects their relative weight from known-sample statistics without unknown validation data. Across CUB, FGVC-Aircraft, and ImageNet-hard, EGUR-A substantially reduces high-confidence false known acceptance at matched known-rejection operating points. The result is not a stronger threshold; it is a different question: whether a known class is entitled to accept a sample.
[CV-120] Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
链接: https://arxiv.org/abs/2605.17807
作者: Baoteng Li,Xianghao Zang,Xinran Wang,Xiangyu Na,Zhixiang He,Hao Sun,Chi Zhang,Zhongjiang He,Tianwei Cao,Kongming Liang,Zhanyu Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model’s current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model’s evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.
[CV-121] Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry
链接: https://arxiv.org/abs/2605.17799
作者: Ningkang Peng,Xuanming Chen,Yanhui Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.
[CV-122] When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection
链接: https://arxiv.org/abs/2605.17795
作者: Ningkang Peng,Jingyang Mao,Runhan Zhou,Peirong Ma,Yanhui Gu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.
[CV-123] Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection
链接: https://arxiv.org/abs/2605.17780
作者: Hang-Cheng Dong,Guodong Liu,Dong Ye,Bingguo Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based methods have become the de facto standard for industrial defect detection. However, their data-hungry nature and inherent “black-box” characteristics often lead to performance bottlenecks and limited trustworthiness in real-world applications. To address these challenges, this paper proposes a novel knowledge-guided loss function that seamlessly integrates model interpretability into the training process without incurring any additional inference cost. Our method operates in two phases: first, a primary classification network is trained, and its explanations, in the form of saliency maps, are generated as prior knowledge. Second, a multi-task learning framework is established, where the main task performs classification, and an auxiliary task imposes consistency between the saliency maps of the final model and the primary model. This consistency is enforced by a dedicated knowledge-guided loss term, effectively acting as a powerful regularizer to steer the model towards robust feature representations. Extensive experiments on multiple public defect datasets demonstrate that our approach consistently enhances the performance of baseline models in terms of accuracy and AP. Moreover, visual analysis reveals that the proposed method yields more concentrated and human-intelligible saliency maps. This work presents a simple yet effective paradigm for bridging the gap between model performance and interpretability, paving the way for more reliable and high-performing vision systems in industrial quality inspection.
[CV-124] Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation
链接: https://arxiv.org/abs/2605.17777
作者: Zizhuo Li,Songchu Deng,Linfeng Tang,Jiayi Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CAA JAS 2026
Abstract:This letter presents LiteLoc, a novel and efficient localizer built on 3D Gaussian Splatting (3DGS). The previous state-of-the-art (SoTA) sparse-to-dense localizer, STDLoc, has shown remarkable localization capability but suffers from severe storage redundancy and computational latency. By revisiting its design decisions, we derive two simple yet highly effective improvements that cumulatively make LiteLoc much more efficient in both memory and computation, while also being easier to train. One key observation is that the color field, inherited directly from Feature 3DGS, is functionally useless for localization. Yet, its reconstruction of high-frequency photometric details necessitates excessive Gaussian primitives, resulting in a tightly coupled color-feature representation with significant memory overhead and sub-optimal feature field optimization. To resolve this, we propose a color-free decoupled feature field that constructs a compact Gaussian scene representation by retaining only task-essential feature attributes, thereby eliminating approximately 94% of redundant storage with no loss of localization-relevant information. We further find that the primary computational bottleneck lies in the dense Perspective-n-Point (PnP) solver, where most matches contribute saturated geometric constraints with diminishing accuracy gains. Accordingly, we propose a condensing strategy that distills dense matches into a subset of 5% representative matches, enabling a nearly 19-fold speedup in robust estimation with negligible performance drop. Extensive experiments show that LiteLoc surpasses STDLoc in multiple scenes with considerable efficiency benefits, opening up exciting prospects for latency-sensitive visual localization.
[CV-125] PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation
链接: https://arxiv.org/abs/2605.17773
作者: Xinpeng Liu,Hiroaki Santo,Yosuke Toda,Fumio Okura
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Journal of Computer Vision, 2026
Abstract:Accurate estimation of plant skeletal structures (e.g., branching structures) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. To address this problem, we introduce PlantPose, a universal plant skeleton estimator via tree-constrained graph generation. PlantPose combines learning-based graph generation with traditional graph algorithms to enforce tree constraints during the training loop. To enhance the model’s generalization capability, we curate a large and diverse dataset comprising real-world and synthetic plant images, along with simplified representations (e.g., sketches and abstract drawings). This dataset enables the generalized model to adapt to diverse input styles and categories of plant images while preserving topological consistency. Our approach demonstrates robust and accurate plant skeleton estimation across multiple domains, including previously unseen out-of-domain scenarios. Further analyses highlight the method’s strengths and limitations in handling complex, heterogeneous data distributions. All implementations and datasets are available at this https URL.
[CV-126] owards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework
链接: https://arxiv.org/abs/2605.17772
作者: Ziyang Liu,Hongyuan Wang,Zijian Wang,Yinxi Lu,Yunzhao Zang,Zhiqiang Yan,Qianhao Ning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.
[CV-127] LatentUMM: Dual Latent Alignment for Unified Multimodal Models
链接: https://arxiv.org/abs/2605.17766
作者: Yinyi Luo,Wenwen Wang,Hayes Bai,Marios Savvides,Jindong Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: this https URL.
[CV-128] FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
链接: https://arxiv.org/abs/2605.17759
作者: Lichen Ma,Zipeng Guo,Yu He,Xiaolong Fu,Luohang Liu,Jingling Fu,Junshi Huang,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency with the preservation of high-frequency details. They frequently resort to patch-based compression or restricted local decoding, leading to a “spectral compromise” where high-frequency and fine-grained pixel information are suppressed. To address these challenges, we propose \textbfFrequencyBooster, a novel framework designed to empower pixel diffusion with full-frequency modeling capabilities without prohibitive overhead. The core of our method is a high-capacity decoder that specializes in extracting exhaustive high-frequency details and low-frequency semantics, the latter of which is derived from a Diffusion Transformer (DiT) backbone. Unlike prior works that sacrifice global context for local refinement, FrequencyBooster leverages high-dimensional feature representations to maintain global structural integrity while achieving superior pixel-level precision. Extensive experiments on ImageNet demonstrate the effectiveness of our approach: our model achieves a state-of-the-art FID of \textbf1.60 at 256 \times 256 resolution within only 320 epochs. Furthermore, at 512 \times 512 resolution, FrequencyBooster attains an FID of \textbf1.69, significantly outperforming existing pixel-space and latent-space generative models.
[CV-129] Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction
链接: https://arxiv.org/abs/2605.17748
作者: Yu Li,Puchao Zhou,Yachun Mi,Yanfeng Wu,Xiaoming Wang,Shaohui Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.
[CV-130] MoASE: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation
链接: https://arxiv.org/abs/2605.17743
作者: Ronyu Zhang,Aosong Cheng,Gaole Dai,Yulin Luo,Jiaming Liu,Li Du,Huanrui Yang,Dan Wang,Leyuan Fang,Yuan Du,Shanghang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes-ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.
[CV-131] Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis
链接: https://arxiv.org/abs/2605.17729
作者: Danu Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Korea Software Congress (2025)
Abstract:Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.
[CV-132] GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
链接: https://arxiv.org/abs/2605.17727
作者: Zesheng Li,Chengchang Pan,Honggang Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbfGraSP-VL, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbfSemantic Matryoshka interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below 10^-6 . It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.
[CV-133] Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation
链接: https://arxiv.org/abs/2605.17719
作者: Diego Adame,Fabian Vazquez,Jose A. Nunez,Huimin Li,Jinghao Yang,Erik Enriquez,DongChul Kim,Haoteng Tang,Bin Fu,Pengfei Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textitPatch-MoE Mamba, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.
[CV-134] Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision
链接: https://arxiv.org/abs/2605.17686
作者: Mohamad Yazan Sadoun,Sarah Sharif,Yaser Mike Banad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd = 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.
[CV-135] Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition
链接: https://arxiv.org/abs/2605.17685
作者: Arioua,Islameddine,Benzaoui,Amir,Zeroual,Abdelhafid,Houam,Lotfi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:
Abstract:Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Signal Processing (eess.SP); Systems and Control (eess.SY) Cite as: arXiv:2605.17685 [cs.CV] (or arXiv:2605.17685v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.17685 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Digital Signal Processing 2026 Related DOI: https://doi.org/10.1016/j.dsp.2026.106252 Focus to learn more DOI(s) linking to related resources
[CV-136] GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning
链接: https://arxiv.org/abs/2605.17682
作者: Cheng Chen,Hao Huang,Saurabh Bagchi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive’s evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.
[CV-137] A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures
链接: https://arxiv.org/abs/2605.17673
作者: E. O. Rodrigues,T. M. Porcino,Aura Conci,Aristofanes C. Silva
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The objective of this work is to propose a novel methodology for the finger knuckle print recognition, which is essentially a digital photo of the finger-knuckle region. We have employed very simple concepts of visual computing such as a filter based on the Sobel operator for finding edges and a simple noise reduction algorithm. These operations are exceptionally fast and produce binary images, which are very efficient to process and to store. Furthermore, alongside this preprocessing, some similarity measures were also regarded and evaluated for the task. After preprocessing an input finger it is compared to all the images of fingers in the dataset, one by one. We have obtained up to 17.02% of successful recognitions (true positive rate) with a large dataset.
[CV-138] Deep learning-based compression of giga-resolution whole slide images
链接: https://arxiv.org/abs/2605.17668
作者: Maren Høibø,Etienne Gaucher,Ingerid Reinertsen,Marit Valla,Erik Smistad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implementation of digital pathology leads to an increased number of whole slide images (WSIs). The large size of WSIs is challenging. Today, WSIs are compressed with codecs like JPEG resulting in several gigabytes per WSI, and large amounts of space are wasted storing glass. In this study, deep learning-based tissue segmentation for glass removal, and deep learning compression methods were explored and compared with JPEG, JPEG-2000 and JPEG-XL. Image pyramids (N=21) with intact glass, glass replaced by single-colored pixels, and glass replaced by zero-byte tiles were created and compressed with JPEG, JPEG-XL and a deep learning model. Additionally, several compression models were evaluated on a tissue patch dataset and compared with JPEG, JPEG-2000 and JPEG-XL. Removing glass reduced file sizes considerably for JPEG and JPEG-XL. Deep learning-based image compression reduced the WSI size by 43-72% compared to JPEG compression, whereas deep learning-based glass removal reduced the WSI size by 0.3-33%, and 6-62% using only single-colored pixels and removing all-glass tiles, respectively. Combining the two gave a small improvement to a 44-80% total size reduction which indicates that deep learning-based image compression is able to efficiently compress glass tiles, whereas JPEG is not. On the tissue patch dataset, the best deep learning-based compression models saved on average ~35-40% per patch compared to JPEG, while keeping an average SSIM above 0.95, whereas JPEG-XL and JPEG-2000 saved 17% and 14%, respectively while keeping an SSIM of 0.96. However, the deep learning models had higher decompression times than JPEG and JPEG-XL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.17668 [cs.CV] (or arXiv:2605.17668v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.17668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-139] Mono-Hydra: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping
链接: https://arxiv.org/abs/2605.17661
作者: U. V. B. L. Udugama,George Vosselman,Francesco Nex
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. 50 pages, figures and tables included. Code: this https URL
Abstract:Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.
[CV-140] ouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts
链接: https://arxiv.org/abs/2605.17638
作者: Sophokles Ktistakis,Rui Wang,Bastian Grande,Hugo Sax
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.
[CV-141] SparseSAM: Structured Sparsification of Activations in Segment Anything Models
链接: https://arxiv.org/abs/2605.17633
作者: Hoai-Chau Tran,Chi H. Nguyen,Duy M. H. Nguyen,Mathias Niepert,Fan Lai,Khoa D. Doan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.
[CV-142] SegRAG : Training-Free Retrieval-Augmented Semantic Segmentation
链接: https://arxiv.org/abs/2605.17630
作者: Abderrahmene Boudiaf,Irfan Hussain,Sajid Javed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Here’s a trimmed version under 1920 characters: Open-vocabulary segmentation models such as SAM3 achieve strong performance through concept-level text prompting, yet degrade when the target class is visually underrepresented in pretraining data or when its appearance departs from canonical depictions. Text prompts provide no spatial signal to resolve such ambiguity. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with spatially precise, class-specific point prompts derived from a curated DINOv3 feature bank. During an offline stage, patch-level descriptors are extracted from annotated reference images using a frozen DINOv3 ViT-L/16 backbone and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape between the query image and retrieved prototypes, identifies spatially coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. These point prompts are delivered to SAM3 alongside the class-name text in a single joint grounding pass, enabling the mask decoder to resolve semantic intent and spatial evidence together. SegRAG requires no task-specific training and no synthetic data. On four open-vocabulary benchmarks it achieves consistent gains over the SAM3 text-only baseline, with improvements of up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks representing a zero-shot domain transfer setting, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablation studies confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.17630 [cs.CV] (or arXiv:2605.17630v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.17630 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-143] Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning ATC
链接: https://arxiv.org/abs/2605.17624
作者: Miquel Martí i Rabadán,Alessandro Pieropan,Hossein Azizpour,Atsuto Maki
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL
Abstract:We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.
[CV-144] SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing
链接: https://arxiv.org/abs/2605.17620
作者: Marten J. Finck,Niklas C. Koser,Sarker M. Mahfuz,Tameem Jahangir,Jon E. Wilhelm,Daniel Behme,Naomi Larsen,Wojtek Palubicki,Sylvia Saalfeld,Sören Pirk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.
[CV-145] AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
链接: https://arxiv.org/abs/2605.17602
作者: Kuei-Chun Kao,Daixuan Huo,Yuanhao Ban,Cho-Jui Hsieh
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages
Abstract:Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a \ell_1 -Regularized Logistic Regression Refiner, which selects the Top- N most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.
[CV-146] Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception
链接: https://arxiv.org/abs/2605.17591
作者: Guowei Luo(1),Ziqi Shi(2),Zhao Xie(1) ((1) Hefei University of Technology, Hefei, China, (2) Lishui University, Lishui, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures. Preprint
Abstract:Aggregate object detection metrics inherently mask catastrophic and repeatable failures in operationally critical, long-tail minority classes. This paper formally defines this pervasive vulnerability as the Hard-Category Reliability Problem (HCRP): the fundamental architectural challenge of strictly rectifying vulnerable categories without compromising the performance boundaries of stable classes under stringent protocols. To systematically dismantle this limitation, we propose Error-Decomposed Class-Conditional Fusion (ED-CCF), an elegant decision-layer inference framework. Diverging from heuristic global post-processing, ED-CCF projects predictions into a sophisticated quad-state error taxonomy, dynamically activating calibration pathways exclusively upon rigorous empirical justification. On a highly constrained 600-image validation benchmark, isolating cz as the critical vulnerability (HCEC=0.86, BSR=0.14), our framework achieves a targeted breakthrough: it elevates cz mAP50 from 0.089343 to 0.109353 (a massive +22.4% relative surge) while flawlessly preserving the Pareto optimality of global stability (raising all mAP50 from 0.581925 to 0.584864). Backed by exhaustive validation across 50 paired subset trials demonstrating an overwhelming 96% win rate and strict Bonferroni-corrected Wilcoxon significance (p0.05), this work fundamentally redefines output-level fusion as an auditable, statistically guaranteed paradigm for safety-critical visual perception.
[CV-147] MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution
链接: https://arxiv.org/abs/2605.17588
作者: Leonid Bedratyuk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages
Abstract:Assessing the quality of single image super-resolution (SISR) results remains an open methodological problem. Common full-reference metrics (PSNR, SSIM, LPIPS) do not explicitly evaluate the preservation of the geometric structure of images, which is critical for the correctness of scale-based reconstruction. In addition, they require the forced alignment of images to the same size (\textitforced resizing), which introduces an external interpolation error into the evaluation process. This paper proposes a diagnostic scale-invariant quality measure, MSIQ (\textitMoment-based Scale-Invariant Quality), based on the comparison of normalized central geometric moments of two images. MSIQ enables direct comparison of images with different spatial resolutions without resizing, is mathematically deterministic (\textitmodel-free), and has an analytical form. To provide a theoretical basis for the approach, we introduce a conceptual distinction between the ability of metrics to monotonically track degradation (\textittracking ability) and their geometric selectivity (\textitgeometric specificity). The experimental validation confirmed the stability of MSIQ under uniform scaling and, at the same time, revealed the high sensitivity of traditional metrics to the choice of interpolation method. The results show that MSIQ has pronounced geometric selectivity: the proposed measure effectively separates geometric deformations from non-geometric artifacts, in particular JPEG compression, unlike pixel-based and perceptual metrics. It is also shown that the response of MSIQ to structural perturbations remains stable across different classes of SR algorithms, including DNN models with different architectures. The proposed measure is a complementary diagnostic tool for domains where geometric fidelity has priority, in particular medical imaging and remote sensing. Comments: 23 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.17588 [cs.CV] (or arXiv:2605.17588v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.17588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-148] VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos CVPR
链接: https://arxiv.org/abs/2605.17584
作者: Zhijing Lu,Khurram Azeem Hashmi,Didier Stricker,Muhammad Zeshan Afzal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 figures, cvpr workshop
Abstract:Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering this http URL propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.
[CV-149] AgentS teerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech
链接: https://arxiv.org/abs/2605.17583
作者: Bin Kang,Shaoguo Wen,Yang Fan,Shunlong Wu,Junjie Wang,Yulin Li,Junzhi Zhao,Junle Wang,Zhuotao Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.
[CV-150] AME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
链接: https://arxiv.org/abs/2605.17577
作者: Xin Wang,Yixu Wang,Jiaming Zhang,Ruofan Wang,Jiaqi Yu,Kai Chen,Jingjing Chen,Xingjun Ma,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT’s single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.
[CV-151] Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks
链接: https://arxiv.org/abs/2605.17573
作者: Mohammadreza Rashidi,Raja Hashim Ali,Sami Ur Rahman
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 13 pages, 6 figures
Abstract:Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.
[CV-152] Stable Routing for Mixture-of-Experts in Class-Incremental Learning
链接: https://arxiv.org/abs/2605.17571
作者: Zirui Guo,Quan Cheng,Da-Wei Zhou,Lijun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.
[CV-153] Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework
链接: https://arxiv.org/abs/2605.17566
作者: Yumeng Yao,Jingzhi Dong,Haowen Gu,Tao Chen,Zonghan Wu,Xiaoshui Huang,Yazhou Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures. Code will be released upon acceptance
Abstract:With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in this http URL on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.
[CV-154] A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation NEURIPS2026
链接: https://arxiv.org/abs/2605.17564
作者: Tseten Sherpa,Sikandar Ali,Shubham Parab,Haoyun Feng,Matthew Dennis,Keenan Gibbons,Verrah Otiende,Geoffrey H. Siwo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, NeurIPS 2026
Abstract:Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.
[CV-155] Real-Time Neural Hair Denoising
链接: https://arxiv.org/abs/2605.17557
作者: Chenghao Wu,Yuefan Shen,Tao Huang,Kai Yan,Zahra Montazeri,Kui Wu
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a lightweight real-time method for reconstructing strand-based hair G-Buffers from severely undersampled rasterized inputs. Our pipeline first applies neural spatial reconstruction and temporal accumulation to recover hair coverage, i.e., fractional hair visibility within a pixel, and tangent. It then uses a tangent-guided reconstruction step to complete the position, which is subsequently used for physically based deferred hair shading. We evaluate our method across a diverse set of hairstyles, including straight, wavy, afro, and ponytail styles, under both static and dynamic scenarios. Our method achieves higher hair reconstruction quality than existing hair-specific denoising techniques and general industrial neural reconstruction solutions such as DLSS and FSR.
[CV-156] PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation
链接: https://arxiv.org/abs/2605.17555
作者: Snigdha Chandan Khilar
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current topology aware diffusion models face an architectural mismatch by using Gaussian noise for corruption while recovering structural features through conditional side channels To fix this we introduce PFlow T a generative model that bases its forward process entirely on persistent homology In PFlow T time measures the destruction of H1 topological features like holes rather than Gaussian noise injection This forward process eliminates features based on their persistence The reverse network then directly inverts this structured corruption to predict the clean state in one step Tests on MNIST digits zero one and eight show PFlow T significantly outperforms a baseline model in generating requested Betti numbers and handling out of distribution tasks PFlow T is the first generative architecture using persistent homology for the forward process although we note it is currently limited to low resolution pixel space proxies
[CV-157] HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos
链接: https://arxiv.org/abs/2605.17543
作者: Jeongeun Park,Janghyeok Han,Geonung Kim,Hyun-Seung Lee,Kyuha Choi,Youngseok Han,Sunghyun Cho
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Supplementary material and video included
Abstract:Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.
[CV-158] textitDont Guess Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
链接: https://arxiv.org/abs/2605.17531
作者: Yuting Yang,Haichao Jiang,Tianming Liang,Quan Zhang,Jian-Fang Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbfIC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbfHi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbfAmbi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \urlthis https URL.
[CV-159] Designing streetscapes from street-view imagery using diffusion models
链接: https://arxiv.org/abs/2605.17527
作者: Yuzhou Chen,Yuebing Liang,Lingqian Hu,Kailai Sun,Qingqi Song,Chang Zhao,Shenhao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.
[CV-160] Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration
链接: https://arxiv.org/abs/2605.17506
作者: Xinghua Huang,Zhixiong Yang,Chen Wu,Shengxi Li,Shuaifeng Zhi,Yue Zhang,Qibin Hou,Xin Deng,Jingyuan Xia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.
[CV-161] A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle
链接: https://arxiv.org/abs/2605.17504
作者: Guancheng Zhou,Yisi Luo,Zhengfu He,Zhenyu Jin,Xuyang Ge,Wentao Shu,Deyu Meng,Xipeng Qiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top- K activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.
[CV-162] he Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation
链接: https://arxiv.org/abs/2605.17500
作者: Ninad Joshi,Ashutosh Ranjan,Vivek Srivastava,Shirish Karande
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative text-to-image models are typically trained on large-scale web-scraped datasets that include diverse visual content such as copyrighted and stylistically distinctive artworks, raising concerns about ownership, attribution, and the unintended reuse of protected visual expressions. A key issue is that models can learn stylistic patterns from this data and reproduce them in generated outputs without any explicit reference in the prompt. We refer to this phenomenon as The Silent Brush, where such learned styles reappear even when they are not requested. Existing evaluation methods mainly focus on near-duplicate retrieval or membership inference and do not account for this form of unintended stylistic resurfacing across prompts. To address these gaps, we first formulate guiding principles for evaluation of The Silent Brush. We then introduce Art Arena, an evaluation protocol that measures how strongly artworks are encoded, how they interact, and how frequently their stylistic traits reappear in generated outputs without explicit mention in prompts. We evaluate Art Arena on widely used text-to-image diffusion models, including Stable Diffusion v1.5, Stable Diffusion XL (SDXL), and SANA-1.5, and design it to generalize across text-to-image generative systems. Our results show that The Silent Brush arises from differences in representational strength and interaction dynamics between artworks, leading to asymmetric blending in model generations. Code and evaluation resources are available at: this https URL.
[CV-163] Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE
链接: https://arxiv.org/abs/2605.17493
作者: Minjong Cheon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.
[CV-164] Employing Vision-Language Models for Face Image Quality Assessment
链接: https://arxiv.org/abs/2605.17489
作者: Erdi Sarıtaş,Eren Onaran,Vitomir Štruc,Hazım Kemal Ekenel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as “black boxes.” They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs’ outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at this https URL.
[CV-165] Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
链接: https://arxiv.org/abs/2605.17488
作者: Yuheng Chen,Qingdong He,Teng Hu,Yuji Wang,Yabiao Wang,Lizhuang Ma,Jiangning Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe “speech leakage” problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.
[CV-166] On Applicability of Synthetic Datasets for Facial Expression Recognition
链接: https://arxiv.org/abs/2605.17483
作者: Ali Azmoudeh,Erdi Sarıtaş,Ömer Yıldırım,Hazım Kemal Ekenel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial Expression Recognition faces two core challenges. The first is class imbalance in public datasets, which skews the learning process and weakens generalization. The second is related to privacy and data collection constraints, which limit the sharing of facial images and restrict the creation of large, balanced datasets. To address these issues, we examine three complementary strategies for constructing privacy-preserving FER datasets in the standard seven discrete facial expression classes setting. Our strategies are: (i) pseudo-labeling large unlabeled face collections with a teacher model under a confidence-thresholding scheme, (ii) prompt-driven synthesis using diffusion models conditioned on demographic attributes, and (iii) task-aware GAN-based expression editing that modifies facial expression while preserving identity and realism. For training and evaluation, we employed widely adopted datasets, including AffectNet, RAF-DB, and FER2013. We utilized the synthetic datasets DigiFace, DCFace, and EmoNet-Face BIG as unlabeled sources for pseudo-labeling. Additionally, we utilized the FFHQ dataset as the source for generative synthesis. The main experiments are conducted using a classic CNN backbone, IR50, and we also explore a more complex architecture, POSTERv1, to assess its feasibility and robustness. Using cross-dataset evaluations, we analyze the trade-offs each strategy presents in curated datasets. The findings demonstrate how synthetic data can effectively substitute or be combined with real datasets to mitigate imbalance and privacy limitations. Code and generated datasets:this https URL
[CV-167] Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
链接: https://arxiv.org/abs/2605.17478
作者: Tianchen Deng,Zhenxiang Xiong,Nailin Wang,Fangjinhua Wang,Jiuming Liu,Jianfei Yang,Hesheng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.
[CV-168] Weighted Reverse Convolution for Feature Upsampling
链接: https://arxiv.org/abs/2605.17472
作者: Wentong Li,Zhiyuan Qi,Zichen Zhao,Kai Zhang,Lei Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7figures
Abstract:Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf\textitinverse problem and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.
[CV-169] EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution
链接: https://arxiv.org/abs/2605.17470
作者: Hanli Zhao,Binhao Wang,Shihao Zhao,Tao Wang,Kaihao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted by Information Fusion; 20 pages, 17 figures
Abstract:Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed (\sim 2\times) . The source code is available at \urlthis https URL.
[CV-170] GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging
链接: https://arxiv.org/abs/2605.17456
作者: Xiangyu Li,Ran Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 17 figures, 24 table
Abstract:Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.
[CV-171] DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking
链接: https://arxiv.org/abs/2605.17451
作者: Guyue Hu,Haoming Liu,Siyuan Song,Chenglong Li,Feng Chen,Jin Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.
[CV-172] Spatial Blindness in Whole-Slide Multiple Instance Learning
链接: https://arxiv.org/abs/2605.17449
作者: Xiangyu Li,Ran Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 8 figures, 16 tables
Abstract:Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.
[CV-173] VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation MICCAI2026
链接: https://arxiv.org/abs/2605.17433
作者: Zhipeng Deng,Jiale Zhou,Wenhan Jiang,Haolin Wang,Xun Lin,Yafei Ou,Yefeng Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2026 early accept
Abstract:Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at this https URL.
[CV-174] Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning
链接: https://arxiv.org/abs/2605.17429
作者: Ningkang Peng,Jingyang Mao,Xiaoqian Peng,Weiguang Qu,Yanhui Gu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.
[CV-175] Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
链接: https://arxiv.org/abs/2605.17423
作者: Yiren Song,Huilin Zhong,Kevin Qinghong Lin,Haofan Wang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.
[CV-176] RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection
链接: https://arxiv.org/abs/2605.17368
作者: Shuchang Ye,Mingyuan Meng,Hao Wang,Usman Naseem,Jinman Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.
[CV-177] Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification
链接: https://arxiv.org/abs/2605.17367
作者: Decheng Liu,Bin Hu,Xinbo Gao,Dawei Zhou,Chunlei Peng,Nannan Wang,Ruimin Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code and models are publicly available at this https URL
Abstract:Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbfsketch biometric identification, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model’s robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.
[CV-178] Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval
链接: https://arxiv.org/abs/2605.17365
作者: Xianke Chen,Daizong Liu,Yushuo Lou,Xin Tan,Xun Yang,Shuhui Wang,Xun Wang,Jianfeng Dong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user’s query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4% compared with the prior baseline ChatIR. Source code is available at this https URL.
[CV-179] Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
链接: https://arxiv.org/abs/2605.17360
作者: Chaoqun He,Mingyang Xiang,Yingjing Xu,Bokai Xu,Junbo Cui,Jie Zhou,Yuan Yao,Lijie Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 6 figures
Abstract:Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.
[CV-180] UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings
链接: https://arxiv.org/abs/2605.17356
作者: Bo Zhao,Maosheng Pang,Chen Zhang,Huan Yang,Yixin Cao,Wei Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.
[CV-181] GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction
链接: https://arxiv.org/abs/2605.17354
作者: Weiquan Lin,Yaoqing Hu,Liangchen Dai,Xu Tang,Xingyu Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.
[CV-182] Position: Age Estimation Models Do Not Process Biometric Data ICML2026
链接: https://arxiv.org/abs/2605.17347
作者: Nikita Marshalkin
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, 3 tables. Accepted as a position paper at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:When a neural network estimates someone’s age from a photograph, does it process biometric data? The answer depends on whether identity-discriminative representations arise within the network during inference, a question that may seem trivial to ML researchers but triggers consent requirements under GDPR, statutory damages under BIPA, or high-risk AI classification under the EU AI Act. Yet no regulatory guidance addresses it. This position paper provides empirical evidence: 14 models evaluated across 3 face verification benchmarks show age estimators fall orders of magnitude short of identification thresholds. Age estimation models cannot identify individuals. We call on researchers to provide transparency about what systems store and can do, and on regulators to distinguish transient processing from template storage.
[CV-183] VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption MICCAI2026
链接: https://arxiv.org/abs/2605.17345
作者: Xinyao Liu,Zhipeng Deng,Wenhan Jiang,Haolin Wang,Xun Lin,Yafei Ou,Yefeng Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted version to MICCAI 2026 (Provisional Accept)
Abstract:The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the z -axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the \ell_1 divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ( \epsilon=4/255 ) to preserve high visual fidelity. The code is available at this https URL.
[CV-184] GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction
链接: https://arxiv.org/abs/2605.17343
作者: Zilong Li,Chenglong Ma,Yiming Lei,Yuanlin Li,Jing Han,Jiannan Liu,Huidong Xie,Junping Zhang,Yi Zhang,Hongming Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.
[CV-185] Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment
链接: https://arxiv.org/abs/2605.17341
作者: Jiaqing Li,Yajuan Lu,Xiaochuan Shi,Gang Wu,ZhongYuan Wang,Chao Liang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model’s training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.
[CV-186] actile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision Language and Contact-Driven Paradigms
链接: https://arxiv.org/abs/2605.17336
作者: Zhixiang Cao,Di Tian,Runwei Guan,Yanzhou Mu,Xiaolou Sun,Shaofeng Liang,Daizong Liu,Tao Huang,Yutao Yue,Henghui Ding,Bin Fang,Alex Zhou,Qing-Long Han,Hui Xiong
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 20 pages, 8 figures
Abstract:Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.
[CV-187] Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
链接: https://arxiv.org/abs/2605.17327
作者: Yuantai Zhang,Jiaqi Yang,Huajian Zeng,Changhao Chen,Haoang Li,Liang Li,Dezhen Song,Xingxing Zuo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at this https URL.
[CV-188] VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers
链接: https://arxiv.org/abs/2605.17312
作者: Yiren Song,Wangzi Yao,Haofan Wang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and this http URL address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.
[CV-189] SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection
链接: https://arxiv.org/abs/2605.17311
作者: Zixi Wei,Huixuaun Zhang,Xiaojun Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.
[CV-190] Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
链接: https://arxiv.org/abs/2605.17310
作者: Zhiqiang Wang,Dongrui Liu,Yan Li,Zonghao Ying,Wei Xue,Wenhan Luo,Yike Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbfAttention Hijacking, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.
[CV-191] StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting CVPR2026
链接: https://arxiv.org/abs/2605.17309
作者: Aleksandr Simonyan,Nipun Jindal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables
Abstract:We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.
[CV-192] LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos
链接: https://arxiv.org/abs/2605.17303
作者: Chenyi Xu,Yihao Wu,Liqi Yan,Chao Yang,Jianhui Zhang,Fangli Guan,Pan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.
[CV-193] HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
链接: https://arxiv.org/abs/2605.17294
作者: Yuyao Zhang,Alexander Huang-Menders,Yu-Wing Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbfLocal-Window MMDiT) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbfInference Acceleration) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our \hrefthis https URL\textbfProject Page.
[CV-194] LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation
链接: https://arxiv.org/abs/2605.17287
作者: Jun Ma,Zhenye Yang,Ruichen Zhou,Pei Zhang,Huan Li,Jinpeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 3 tables
Abstract:Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbfLanguage-guided \textbfInterference-aware \textbfSpatial-Frequency \textbfAttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at this https URL.
[CV-195] HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone
链接: https://arxiv.org/abs/2605.17286
作者: Guanyiman Fu,Jingtao Li,Zihang Cheng,Zhuanfeng Li,Diqi Chen,Yan Xu,Fengchao Xiong,Jianfeng Lu,Jun Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, to address the scarcity and inconsistency of labels, we introduce a multi-source pseudo-labeling method that fuses semantic representations from both spatial structures generated by SAM2 and fine-grained spectral material information extracted by HyperFree. Third, to compensate for limited dataset scale and enrich scene diversity, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our hyperspectral backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation \mathrmAcc_\mathrmM , a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available at this https URL .
[CV-196] CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving
链接: https://arxiv.org/abs/2605.17284
作者: Ruiyang Zhu,Yuehan He,Boyuan Zheng,Zesen Zhao,Ahmad Chalhoub,Qingzhao Zhang,Z. Morley Mao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 9 pages + appendix
Abstract:End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs’ latent space: (i) at the VLA’s hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.
[CV-197] Beyond Detection: A Structure-Aware Framework for Scene Text Tracking ICML2026
链接: https://arxiv.org/abs/2605.17270
作者: Chenmin Yu,Liu Yu,Daiqing Wu,Gengluo Li,Zeyu Chen,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. Code is available at: [ this https URL ]
Abstract:Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97% AUC on \textBOVText_\textSOT . Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.
[CV-198] Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation CVPR2026
链接: https://arxiv.org/abs/2605.17268
作者: Nicanor Mayumu,Xiaoheng Deng,Patrick Mukala
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track
Abstract:We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.
[CV-199] EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
链接: https://arxiv.org/abs/2605.17262
作者: Zeyu Wang,Chang Liu,Eduardus Tjitrahardja,Yuntao Wang,Borislav Pavlov,Fangfei Gou,Jose Manuel Davila,Dai Shi,Ran Xu,Yue Pan,Jiayi Tan,Shuting Chang,Qi Wang,Jinzhao Li,Jiacheng Hua,Yifei Huang,Jingwei Sun,Yu Zhang,Liuxin Zhang,Guocai Yao,Jia Jia,Yin Li,Qianying Wang,Yuanchun Shi,Miao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite extensive efforts on egocentric video datasets and benchmarks, understanding users’ internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users’ interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users’ internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users’ subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: this https URL
[CV-200] LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLM s
链接: https://arxiv.org/abs/2605.17260
作者: Jihwan Kim,Nikhil Parthasarathy,Danfeng Qin,Junhwa Hur,Deqing Sun,Bohyung Han,Ming-Hsuan Yang,Boqing Gong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on “post-hoc” token reduction – reducing visual tokens after feature extraction to alleviate the LLM’s computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier – compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8 \times more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.
[CV-201] Monocular Depth Perception Enhancement Based on Joint Shading/Contrast Model and Motion Parallax (JSM)
链接: https://arxiv.org/abs/2605.17252
作者: Seungchul Ryu,Hyunjin Yoo,Tara Akhavan
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Stereoscopic 3D displays adopt a binocular depth cue to provide depth perception. However, users should be equipped with expensive special devices to appreciate depth perception based on the binocular depth cues. Also, visual fatigue induced by the stereoscopic display is still a challenging open problem. In order to overcome this limitation, this paper proposes a novel framework, JSM, to enhance monocular depth perception, significantly improving both depth volume perception and depth range perception. The proposed framework can not only provide an enhanced depth perception on any conventional 2D display devices, but also it can be applicable to the 3D display devices since it is complementary to binocular depth cues. The qualitative evaluation, ablation study, and subjective user evaluation proved the advantages and practicability of the proposed framework.
[CV-202] Image-to-Video Diffusion: From Foundations to Open Frontiers
链接: https://arxiv.org/abs/2605.17248
作者: Xianlong Wang,Wenbo Pan,Shijia Zhou,Ke Li,Yuqi Wang,Zeyu Ye,Hangtao Zhang,Leo Yu Zhang,Xiaohua Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based \textitimage-to-video (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.
[CV-203] Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization Statistical Validation and Clinical Interpretability
链接: https://arxiv.org/abs/2605.17236
作者: Nisreen Albzour,Sarah S. Lam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.
[CV-204] OPTNet: Ordering Point Transformer Network for Post-disaster 3D Semantic Segmentation ICPR
链接: https://arxiv.org/abs/2605.17197
作者: Nhut Le,Ehsan Karimi,Maryam Rahnemoonfar
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for International Conference on Pattern Recognition (ICPR) 2026
Abstract:Post-disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k-NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z-order, to organize unstructured points for window-based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self-supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state-of-the-art baselines.
[CV-205] MiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
链接: https://arxiv.org/abs/2605.17179
作者: Chengyan Wang,Haoyu Chen,Hui Wei,Yueyi Yang,Yunquan Chen,Guoying Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players’ public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.
[CV-206] Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives
链接: https://arxiv.org/abs/2605.17165
作者: Santosh Premi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.
[CV-207] When Bits Break Recourse: Counterfactual-Faithful Quantization
链接: https://arxiv.org/abs/2605.17160
作者: Chaymae Yahyati,Ismail Lamaakal,Khalid El Makkaoui,Ibrahim Ouahbi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 57 pages, 32 tables, 26 figures
Abstract:Quantization can preserve predictive accuracy under low-bit deployment while silently breaking algorithmic recourse: an actionable change that flips a decision before quantization may fail after quantization, or become substantially more costly. We formalize counterfactual sensitivity under quantization through validity, cost, and direction stability, and introduce two metrics: Validity Drop (VD) and Counterfactual Recourse Gap (CRG) that reveal recourse failures invisible to accuracy. We propose Counterfactual-Faithful Quantization (CFQ), which trains quantizer parameters and mixed-precision bit allocation to preserve counterfactual behavior by enforcing the target outcome at teacher recourse points under a global bit budget. A margin-based analysis gives a sufficient condition for recourse transfer under bounded quantization perturbations. Experiments on Adult, German Credit, and COMPAS show that accuracy-matched baselines can significantly degrade recourse stability, while CFQ maintains accuracy and substantially improves VD and CRG across bit budgets.
[CV-208] Collaborative Learning for Semi-Supervised LiDAR Semantic Segmentation ICML2026
链接: https://arxiv.org/abs/2605.17135
作者: Bin Yang,Alexandru Paul Condurache
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted by ICML2026
Abstract:Annotating large-scale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes.
[CV-209] CAM-VFD: Cross-Attention Multimodal Video Forgery Detection
链接: https://arxiv.org/abs/2605.17133
作者: Hoda Osama Elkhodary,Sherin Mostafa Youssef,Marwa Elshenawy,Dalia Sobhy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ( p0.001 , Cohen’s d=0.68 ). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31% Top-1 accuracy on GenVidBench and 93.43% accuracy, 90.63% F1-score, and 96.56% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \urlthis https URL.
[CV-210] A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
链接: https://arxiv.org/abs/2605.17131
作者: Minhas Kamal,Hiranya Garbha Kumar,Balakrishnan Prabhakaran
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2 tables and 16 figures, github repo: this https URL
Abstract:Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.
[CV-211] Principal Component Analysis for Lunar Crater Detection
链接: https://arxiv.org/abs/2605.17125
作者: Travis Driver,John A. Christian
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Optical navigation is a critical component for lunar orbiter and lander missions. Image-based crater identification has emerged as a promising technology for optical navigation due to the abundance of craters on the lunar surface and the availability of extensive crater catalogs. Moreover, due to the relative morphological homogeneity among lunar craters, template matching has been identified as a promising approach for identification. In this paper, we propose EigenCrater, an automated crater template generation method based on principal component analysis of crater digital elevation maps (DEMs). We demonstrate superior detection and position estimation performance relative to hand-picked templates on simulated lunar imagery.
[CV-212] Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants
链接: https://arxiv.org/abs/2605.17120
作者: Divya Joshi,J.D. Peiffer,Colleen Peyton,R. James Cotton
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMBC 2026
Abstract:arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.
[CV-213] VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement
链接: https://arxiv.org/abs/2605.17102
作者: Haotian Mao,Yuhan Huang,Jiatao Lin,Yang Zhao,Hui Wang,Yiheng Zhang,Yuwang Wang,Chenliang Zhou,Yan Zhang,Fangcheng Zhong,Xubo Yang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.
[CV-214] Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC
链接: https://arxiv.org/abs/2605.17095
作者: Angela Srbinovska,Christopher Homan,Adrian Martin,Ernest Fokoué
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 10 figures, 9 tables
Abstract:Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.
[CV-215] he Learnability Gap in Medical Latent Diffusion
链接: https://arxiv.org/abs/2605.17087
作者: Mischa Dombrowski,Felix Nützel,Bernhard Kainz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.
[CV-216] EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
链接: https://arxiv.org/abs/2605.17070
作者: Haozhe Shan,Xiancong Ren,Han Dong,Haoyuan Shi,Yingji Zhang,Jiayu Hu,Yi Zhang,Yong Dai,Bin Shen,Lizhen Qu,Zenglin Xu,Xiaozhu Ju
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.
[CV-217] hermal-Only Crowd Counting with Deployment-Time Privacy Protection
链接: https://arxiv.org/abs/2605.17042
作者: Yifei Qian,Zhongliang Guo,Chun Tong Lei,Bowen Deng,Chun Pong Lau,Xiaopeng Hong,Michael P. Pound
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.
[CV-218] StreamingEffect: Real-Time Human-Centric Video Effect Generation
链接: https://arxiv.org/abs/2605.17019
作者: Yiren Song,Cheng Liu,Yuxin Jiang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbfStreamingEffect, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbfVideoEffect-130K, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.
[CV-219] RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos CVPR2026
链接: https://arxiv.org/abs/2605.17014
作者: Lixin Xue,Chengwei Zheng,Georgios Paschalidis,Chen Guo,Manuel Kaufmann,Juan Zarate,Dimitrios Tzionas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL
Abstract:Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at this https URL.
[CV-220] opo-GS: Continuous Volumetric Embedding of High-Dimensional Data via Topological Gaussian Splatting
链接: https://arxiv.org/abs/2605.17011
作者: João Paulo Gois,Luis Gustavo Nonato
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 2 figures
Abstract:Dimensionality reduction algorithms map high-dimensional data into visualizable 2D or 3D spaces, but traditionally rely on a discrete point-cloud paradigm. This discrete abstraction is susceptible to visual occlusion and artificial discontinuities, often failing to represent the continuous density of the underlying manifold. To address these limitations, we introduce Topo-GS, a framework that repurposes 3D Gaussian Splatting (3DGS) to cast multidimensional projection as a meshless volumetric reconstruction process. Instead of standard photometric losses, Topo-GS is driven by local geometric constraints. By solving orthogonal Procrustes targets, the optimization enforces an As-Rigid-As-Possible prior while explicitly aligning the spatial covariance of each Gaussian to the local tangent space. Recognizing that unrolling data of varying intrinsic dimensionalities requires distinct spatial treatments, we utilize a topology-aware strategy that tailors the loss formulation to preserve either continuous 1D trajectories or cohesive 2D surfaces. Quantitative and visual evaluations demonstrate that Topo-GS successfully transforms discrete scatter plots into continuous volumetric representations, where inherent projection distortions explicitly manifest as observable geometric variations, while preserving local topological fidelity comparable to discrete baselines.
[CV-221] DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing
链接: https://arxiv.org/abs/2605.16990
作者: Jinxin Ai,Matthias Nießner,Ziya Erkoç
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures
Abstract:While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.
[CV-222] Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction
链接: https://arxiv.org/abs/2605.16981
作者: Kejun Ren,Lei Jin,Tianxin Huang,Lianming Xu,Li Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
Abstract:Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median 0.31 ; never exceeding 0.6 ) and nearly frame-invariant, yielding an effective memory horizon of only \sim 3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emphhow strongly each frame should contribute to the state has been treated as content-independent. We close this gap with a scalar frame-level gate \alpha_t \in (0, 1] derived in closed form from frame-to-frame changes of internal features – a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to 4,541 frames, our gate cuts ATE by 51% on long TUM-RGBD pose sequences, reduces AbsRel by 12.8% on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.
[CV-223] Statistical Hand Shape Modeling from Clinical CT Scans Using Deep Learning and Implicit Skinning
链接: https://arxiv.org/abs/2605.16980
作者: Gokce Guven,Hasan Fehmi Ates,Deniz Karasahin,Kaan Erdogan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation and statistical shape modeling of hand anatomy have significant implications for medical diagnostics, ergonomics, and biomechanics. This study proposes an AI-assisted reconstruction pipeline for segmenting and analyzing hand anatomy from 1,271 elbow-to-hand (e2h-CT) computed tomography scans. A Pix2Pix-based conditional generative adversarial network is first employed to remove plaster cast and background artifacts from CT volumes. The cleaned scans are then processed in 3D Slicer to extract skin and bone masks, which are converted into closed-surface mesh models. Segmented bone meshes are used to construct skeletal representations, enabling implicit skinning to align all hand models into a standardized anatomical configuration. Subsequently, non-rigid registration is performed on the hand skin surfaces using the Geodesic Based Coherent Point Drift++ (GBCPD++) algorithm to establish point-wise correspondence across subjects. Principal Component Analysis (PCA) is then applied to the registered models to quantify anatomical shape variability. The Pix2Pix preprocessing stage achieved a Dice coefficient of 0.9856 and an IoU of 0.9720 on the held-out test set. Statistical modeling was performed on a subset of 90 scans in which the fingers were fully visible and anatomically separated. The resulting statistical shape distributions demonstrate strong agreement with the U.S. Army Anthropometric Survey (ANSUR II), supporting the anatomical validity of the reconstructed models. The proposed methodology demonstrates significant potential for advancing biomechanical modeling, ergonomic optimization, prosthetic design, and precision medical diagnostics.
[CV-224] SHED: Style-Homogenized Embedding Alignment for Domain Generalization
链接: https://arxiv.org/abs/2605.16973
作者: Kai Gan,Tong Wei
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0% on DomainNet vs. standard fine-tuning).
[CV-225] Expandable Compressible Mineable: Open-World Thermal Image Restoration ICML2026
链接: https://arxiv.org/abs/2605.16967
作者: Pu Li,Huafeng Li,Yafei Zhang,Wen Wang,Neng Dong,Jie Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026
Abstract:In open-world settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from a continual learning perspective. Conceptually, ECMRNet unifies continual degradation learning as an “expand-compress-mine” closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution-driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining Module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost. The source code is available at this https URL.
[CV-226] OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics
链接: https://arxiv.org/abs/2605.16962
作者: Jinjie Shen,Zheng Huang,Yuchen Zhang,Yujiao Wu,Yaxiong Wang,Lechao Cheng,Shengeng Tang,Tianrui Hui,Nan Pu,Zhun Zhong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages
Abstract:Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics – particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbfOmniVL-Guard Pro, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbfTree-Structured Self-Evolving Tool Trajectory Generation, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbfChecker-Guided Agentic Reinforcement Learning (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \urlthis https URL.
[CV-227] Latent Action Control for Reasoning -Guided Unified Image Generation
链接: https://arxiv.org/abs/2605.16961
作者: Fuxiang Zhai,Sixiang Chen,Yingjin Li,Shuaibo Li,Jianyu Lai,Tengjun Huang,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.
[CV-228] Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing
链接: https://arxiv.org/abs/2605.16951
作者: Shaodong Xu,Zexian Li,Zhendong Wang,Litong Gong,Tiezheng Ge,Wengang Zhou,Bo Zheng,Houqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A fundamental challenge in image editing lies in preserving spatial locality: edits should improve targeted content without inadvertently altering surrounding regions. However, most optimization-based editing approaches treat images as holistic entities, causing global policy updates that undermine locality and introduce undesired context changes. We observe that this issue stems from a mismatch between localized editing intent and globally applied optimization signals. Motivated by this insight, we propose Edit-GRPO, preserving Locality while optimizing image editing, a locality-preserving policy optimization framework that explicitly decouples editing and preservation objectives. By assigning region-specific optimization signals to edit and non-edit areas, Edit-GRPO aligns policy updates with the spatial structure of editing tasks, enabling localized improvements while maintaining global visual coherence. This design effectively suppresses common artifacts such as context distortion and boundary inconsistency. Extensive experiments across diverse image editing scenarios demonstrate that Edit-GRPO significantly improves locality preservation while maintaining strong editing performance compared to existing optimization-based methods, validating the generality and effectiveness of the proposed framework.
[CV-229] Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers
链接: https://arxiv.org/abs/2605.16949
作者: Shaodong Xu,Zhendong Wang,Litong Gong,Zexian Li,Wengang Zhou,Tiezheng Ge,Houqiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.
[CV-230] DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
链接: https://arxiv.org/abs/2605.16937
作者: Yi Zuo,Huimin Wu,Lingling Li,Fang Liu,Licheng Jiao,Qing Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method’s superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.
[CV-231] P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction CVPR2026
链接: https://arxiv.org/abs/2605.16925
作者: Kota Shimomura,Hidehisa Arai,Tsubasa Takahashi,Takayoshi Yamashita,Hironobu Fujiyoshi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted CVPR2026 main
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.
[CV-232] Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding
链接: https://arxiv.org/abs/2605.16923
作者: Xiang Gao,Hui Tian,Yanming Zhu,Xuefei Yin,Alan Wee-Chung Liew
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures
Abstract:Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.
[CV-233] Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation
链接: https://arxiv.org/abs/2605.16922
作者: Youngdong Jang,Gyeongrok Oh,Jong Wook Kim,Hyunju Ryu,Hyung-gun Chi,SeungHyeon Kim,Seungryong Kim,Jonghyun Choi,Sangpil Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.
[CV-234] HighSync: High-Quality Lip Synchronization via Latent Diffusion Models
链接: https://arxiv.org/abs/2605.16918
作者: Saeed Firouzi Daghigh,Majid Iranpour Mobarekeh,Mostafa Alavi,Mehdi Bagheri
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 5 tables
Abstract:We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: this https URL
[CV-235] VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction
链接: https://arxiv.org/abs/2605.16911
作者: Xun Chen,Tianchen Deng,Rui Wang,Fangjinhua Wang,Junyi Ma,Hongming Shen,Hesheng Wang,Danwei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00% IoU and 21.08% mIoU ( T=1 ), and 33.64% IoU and 21.43% mIoU with T=2 inference, outperforming existing methods, with only \sim41 M trainable parameters in the occupancy head. Code will be released publicly.
[CV-236] BIDO: A Biometric Identity Online Authentication Framework
链接: https://arxiv.org/abs/2605.16908
作者: Aditya Mithra,Sibi Chakkaravarthy S,Srinivas Kankanala
类目: Emerging Technologies (cs.ET); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Security systems demand continuous, cryptograph- ically robust identity verification without requiring subjects to carry physical tokens, smart cards, or dedicated hardware authenticators. This paper presents BIDO (Biometric Identity Online), a device-free authentication standard that achieves Au- thenticator Assurance Level 2 (AAL2) per NIST SP 800-63B with- out storing long-lived biometric templates, facial images, or any other form of Personally Identifiable Information (PII). BIDO derives Elliptic Curve Digital Signature Algorithm (ECDSA) key material deterministically from a live biometric measurement salted with a user-defined memorized secret at every authen- tication event, eliminating persistent private-key storage while enabling verification from any commodity sensor terminal. The generated credentials are non-discoverable (non-resident) Web Authentication (WebAuthn) credentials, fully compatible with all FIDO2-enabled websites and services without modification on the server side. A multi-stage pipeline, comprising capture of 200 valid biometric samples, feature extraction using the Dlib 68- point facial landmark predictor, affine face alignment, frontality gating, Euclidean distance computation from the inter-eye mid- point, floor-division quantization with divisor q = 8, inter-session drift stabilization, and majority-voting SHA-256 hash binding, produces a Verification Seed (Vseed) from which the WebAuthn credential is transiently derived and immediately zeroized after signing. Evaluated against three prominent face benchmarks (VGGFace2, LFW, and MegaFace), achieving 99.51% verification accuracy on LFW and 92.14% Rank-1 identification accuracy on MegaFace Challenge 1 at 10^6 distractors, with a cryptographic False Accept Rate (FAR) of 0.03%, a False Reject Rate (FRR) of 0.90%.
[CV-237] AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps
链接: https://arxiv.org/abs/2605.16905
作者: Chia-Ying Hsieh,Hsin-Yuan Fang,Chun-Shu Wei
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Post-hoc saliency methods are widely used to interpret deep neural networks, but their faithfulness is difficult to evaluate reliably. Existing evaluations mask features according to saliency-induced feature ordering and measure performance degradation, but this degradation can be confounded by the masking operator: zero masking may create out-of-distribution artifacts, while interpolation-based masking may preserve residual predictive information. We propose Adversarial Information Masking (AIM), a saliency-guided adversarial feature replacement framework for evaluating both saliency-map faithfulness and masking-operator reliability. AIM replaces selected features with values from an adversarial counterpart of the input and compares degradation under complementary masking orders. We assess reliability using random-attribution bias and stability of explanation-method faithfulness rankings. Experiments on image, audio, and EEG tasks suggest that AIM reduces masking-induced bias compared with zero and interpolation-based masking, while revealing modality-dependent differences between signed and unsigned attributions.
[CV-238] WOW-Seg: A Word-free Open World Segmentation Model ICLR2026
链接: https://arxiv.org/abs/2605.16903
作者: Danyang Li,Tianhao Wu,Bin Li,Zhenyuan Chen,Yang Zhang,Yuxuan Li,Ming-Ming Cheng,Xiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Code and benchmark dataset are available at this https URL
Abstract:Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at this https URL.
[CV-239] CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
链接: https://arxiv.org/abs/2605.16901
作者: Houji Wen,Jiangyong Yu,Jun Li,Dawei Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.
[CV-240] LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
链接: https://arxiv.org/abs/2605.16899
作者: Jinzhou Tang,Sidi Liu,Waikit Xiu,Weixing Chen,Keze Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent’s experiences. Experiments demonstrate that our method achieves 2%-3.5% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.
[CV-241] Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities IJCAI2026
链接: https://arxiv.org/abs/2605.16889
作者: Chenglizhao Chen,Yuchen Cao,Xinyu Liu,Mengke Song,Guisheng Zhang,Xiaomin Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2026
Abstract:Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.
[CV-242] Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet
链接: https://arxiv.org/abs/2605.16887
作者: Xin Niu,Enyi Li,Jinchao Liu,Yan Wang,Margarita Osadchy,Yongchun Fang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in IEEE Transactions on Image Processing. See full abstract in the PDF file
Abstract:Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.
[CV-243] owards Generalized Image Manipulation Localization via Score-based Model ICMR2026
链接: https://arxiv.org/abs/2605.16879
作者: Yunfei Wang,Bo Du,Zhe Yang,Xin Liu,Zhiyu Lin,Tianxin Xu,Ji-Zhe Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICMR 2026. 9 pages, 4 figures
Abstract:With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score this http URL ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.
[CV-244] Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions
链接: https://arxiv.org/abs/2605.16877
作者: Toshinori Yamauchi,Hiroshi Kera,Kazuhiko Kawamoto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11+8 pages, 8 figures, 6 tables
Abstract:Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textitfaithfulness to the evidence underlying the model’s decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier’s feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.
[CV-245] HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction CVPR2026
链接: https://arxiv.org/abs/2605.16873
作者: Xi Liu,Weiwei Sun,Zhou Ren,Chris Broaddus,Siyu Huang,Laurent Guigues
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content – artifacts inconsistent with the input views – into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \hrefthis https URLproject website.
[CV-246] Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks CVPR2026
链接: https://arxiv.org/abs/2605.16864
作者: Yachan Guo,JoseLuis Gomez Zurita,Danna Xue,Yi Xiao,AntonioManuel Lopez Pena
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the CVPR 2026 Findings Track
Abstract:Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at this https URL.
[CV-247] Prefix-Adaptive Block Diffusion for Efficient Document Recognition
链接: https://arxiv.org/abs/2605.16861
作者: Mingxu Chai,Ziyu Shen,Chenyu Liu,Kaidi Zhang,Jiazheng Zhang,Dingwei Zhu,Zhiheng Xi,Ruoyu Chen,Jun Long,Jihua Kang,Tao Gui,Qi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17pages,6 figures
Abstract:Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6% over the 2.5B MinerU-Diffusion.
[CV-248] VGGT-CD: Training-Free Robust Registration for 3D Change Detection
链接: https://arxiv.org/abs/2605.16859
作者: Wei Zhang(1),Songhua Li(1),Yihang Wu(1),Qiang Li(1),Qi Wang(1) ((1) Northwestern Polytechnical University, Xi’an, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures. Code is available at: this https URL
Abstract:3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.
[CV-249] Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data
链接: https://arxiv.org/abs/2605.16834
作者: Shiwon Kim,Yu Rang Park
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.
[CV-250] Coarse Semantic Injection for LLM -Conditioned Structured Indoor Prediction
链接: https://arxiv.org/abs/2605.16832
作者: Shuliang Zhu,Tomiwa Adey,Jinjia Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.
[CV-251] Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions
链接: https://arxiv.org/abs/2605.16818
作者: Chiyuan Ma,Zihan Zhou,Tianshu Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256 \times 256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.
[CV-252] QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning
链接: https://arxiv.org/abs/2605.16813
作者: Yiheng Zhang,Zhe Zhu,Tingrui Shen,Zhuojiang Cai,Tianxiao Li,Zixing Zhao,Qiujie Dong,Zhiyang Dou,Jiepeng Wang,Le Wan,Yuwang Wang,Wenping Wang,Yuan Liu,Cheng Lin
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The generation of production-ready quad-dominant meshes is a cornerstone of modern 3D content creation. Generating anisotropic quad-dominant meshes from point clouds is challenging, as existing methods are typically limited to producing either pure triangular meshes or pure quadrilateral meshes with isotropic densities. In this paper, we present QuadLink, a unified framework consisting of three stages for quad-dominant mesh generation by linking points into structured faces. QuadLink formulates polygonal mesh generation as a hybrid centroid-conditioned vertex linking model: it first predicts a unified set of anchors (vertices and face centroids), then learns centroid-conditioned links that associate vertices with face centroids, and finally assembles polygonal faces with a quad-first strategy guided by robust geometric verification strategies. This link-based formulation enables efficient generation of sparse and anisotropic quad-dominant meshes with coherent edge flow and meanwhile supporting hybrid polygonal topology. To construct training data for this model, we further introduce a Tri-to-Quad Operator that converts artistic triangle meshes into quad-dominant training data via global merge selection. Extensive experiments show that QuadLink produces production-ready quad-dominant meshes from point clouds and achieves improved geometric fidelity and topological quality compared to prior baselines. Our method natively supports hybrid polygonal topology, generalizing to arbitrary n-gon meshes without architectural changes.
[CV-253] raining-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending
链接: https://arxiv.org/abs/2605.16810
作者: Jingqi Hou,Hongtian Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, 3 tables
Abstract:We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.
[CV-254] DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion
链接: https://arxiv.org/abs/2605.16807
作者: Yuhan Ping,Yuan Liu,Xiaoxiao Long,Peng Wang,Junhui Hou,Jianyi Zheng,Jia Pan,Xin Li,Cheng Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce \textitDecoRec, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.
[CV-255] Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning CVPR2026
链接: https://arxiv.org/abs/2605.16806
作者: Wen-Hsin Tsai,Chia-Ming Lee,Yuk-Ying Tung
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accetped by CVPR 2026 CVxEdu Workshop
Abstract:Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.
[CV-256] NeuroLiDAR: Adaptive Frame Rate Depth Sensing via Neuromorphic Event-LiDAR Fusion ICRA2026
链接: https://arxiv.org/abs/2605.16805
作者: Darshana Rathnayake,Dulanga Weerakoon,Meera Radhakrishnan,Archan Misra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA2026 accepted
Abstract:LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to \approx 66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by \approx 29% in RMSE while achieving adaptive frame rates between 27.8-47.3 Hz. Our code and dataset are available at this https URL.
[CV-257] EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices
链接: https://arxiv.org/abs/2605.16797
作者: Liuchuan Yu,Erdem Murat,Beichen Wang,Yan Zeng,Tingting Luo,Huizhen Zhou,Shanghao Li,Huining Feng,Zhigen Zhao,Ning Yang,Ke Jing,Yunhao Liu,Ruoya Sheng
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \urlthis https URL.
[CV-258] Watermarks Attack Watermarks: Re-Watermarking as a Generic Removal Strategy
链接: https://arxiv.org/abs/2605.16796
作者: Maria Bulychev,Neil G. Marchant,Benjamin I. P. Rubinstein
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
Abstract:Watermarking combines an imperceptible change to an input image that will trigger a detector, to assert provenance and protect intellectual property. The literature has shown great interest in attacks on watermarking schemes: attackers are clearly motivated to steal copyrighted material or circumvent legislated deepfake protections. In this work, we make a simple-yet-powerful observation: that such attacks on watermarking-like watermarks themselves-seek an imperceptible change to an input image (now already watermarked) that will trigger a detector. This analogy comparing watermark attacks to watermarking itself is highly suggestive: that watermarks could be used to attack watermarks. Our first contribution validates this hypothesis. In rigorous experiments spanning 96 combinations of dataset, victim, and attack watermarks, we show that simply re-watermarking an already watermarked image reliably suppresses the original signal, without requiring gradients, surrogate models, or detection keys. Our second contribution is a simple classifier for detecting the presence and identity of an existing watermark in a given image. Surprisingly, experimental findings demonstrate outstanding overall accuracies 0.878-0.953. This result is of independent interest as a security vulnerability: research shows that method-specific attacks achieve substantially stronger removal than black-box attacks. Taken together, watermark identification combined with re-watermarking successfully reduces bit accuracy by at least 25% and up to 48%. Our work constitutes a cheap, generic, and highly effective attack pipeline, calling into question the reliability of current watermarking schemes to such a simple attack, as well as the value of existing sophisticated attacks.
[CV-259] 3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation
链接: https://arxiv.org/abs/2605.16795
作者: Hwidong Kim,Yunho Kim,Tae-Kyun Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.
[CV-260] Accelerating Rectified Flow Models via Trajectory-Aware Caching
链接: https://arxiv.org/abs/2605.16789
作者: Xiao Liu,Kai Liu,Naiyang Guan,Hongliang Lu,Zhixin Wang,Zhikai Chen,Renjing Pei,Yulun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages,14 figures
Abstract:Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample’s historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.
[CV-261] Encoding Robust Topological Signatures for Hyperdimensional Computing
链接: https://arxiv.org/abs/2605.16785
作者: Arpan Kusari
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at this https URL.
[CV-262] A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis
链接: https://arxiv.org/abs/2605.16779
作者: Mingyang Zhao,Sipu Ruan,Xiaohong Jia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, Code: this https URL
Abstract:This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at this https URL.
[CV-263] VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment
链接: https://arxiv.org/abs/2605.16775
作者: Amy Makawana,Abhijeet Parida,Marius George Linguraru,Julia Ive,Syed Muhammad Anwar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMBC 2026
Abstract:Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer’s disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.
[CV-264] CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris
链接: https://arxiv.org/abs/2605.16774
作者: Zaid Aljundi,Zahra F. Rahmatullah,Mostafa Elemam,Abdullah Moosa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore
Abstract:Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset’s value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.
[CV-265] GLT-PEFT: Gated Lie-Tucker Parameter-Efficient Fine-Tuning for Alzheimers Disease Diagnosis with Hippocampal Segmentation Pretraining
链接: https://arxiv.org/abs/2605.16769
作者: Guanghua He,Hancan Zhu,Gaohang Yu,An Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a promising paradigm for adapting pretrained models under limited data conditions. However, most existing PEFT methods are designed for matrix-structured parameters and are not well suited for high-dimensional convolutional kernels in medical imaging models. Moreover, they typically rely on additive updates and lack mechanisms to preserve the geometric structure of pretrained parameters, while multiplicative (geometry-aware) updates are difficult to integrate within a unified framework. To address this issue, this paper proposes GLT-PEFT, a gated Lie-Tucker parameter-efficient fine-tuning framework for Alzheimer’s disease (AD) diagnosis. The proposed approach transfers a hippocampal segmentation pretrained model to a downstream classification task. Tucker decomposition enables tensor-aware low-rank adaptation of 3D convolutional kernels, while Lie group-based transformations provide structure-preserving multiplicative updates. A gating mechanism further reconciles additive and multiplicative update forms, resulting in a unified and more stable fine-tuning strategy. Extensive experiments demonstrate that GLT-PEFT achieves effective cross-task transfer while significantly reducing trainable parameters, highlighting its effectiveness for efficient and robust adaptation in medical imaging models.
[CV-266] Axial-Relation Guided Fusion State Space Model for Optical-Elevation Sensing Image Segmentation
链接: https://arxiv.org/abs/2605.16768
作者: Feng Gao,Zhilin Jin,Yanhai Gan,Junyu Dong,Qian Du
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by IEEE GRSL 2026
Abstract:Semantic segmentation of multi-source remote sensing images is a fundamental task for Earth observation applications. Existing methods often struggle with insufficient multi-scale context modeling and suboptimal cross-modal feature fusion, limiting their performance in complex high-resolution scenes. To this end, we propose Axial-Relation Guided Fusion Mamba (ARG-Mamba), a state space model-based framework for optical-elevation remote sensing image segmentation. Specifically, we introduce a Multi-Scale State Space Module to capture both fine-grained local details and global contextual dependencies with linear computational complexity. Moreover, an Axial-Relation Guided Fusion Module is designed to explicitly model global cross-modal correlations along horizontal and vertical axes, enabling efficient feature fusion between optical and elevation modalities. Extensive experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate that our ARG-Mamba consistently outperforms state-of-the-art methods while maintaining favorable computational efficiency. The code will be made publicly available at \urlthis https URL.
[CV-267] Synthetic Aperture Radar Image Change Detection Based on Global Dynamic Context-Aware Network
链接: https://arxiv.org/abs/2605.16764
作者: Baogui Huan,Chuanzheng Gong,Dezhong Chen,Feng Gao,Junyu Dong,Qian Du
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by IEEE JSTARS 2026
Abstract:Convolutional neural networks (CNNs) have been extensively and successfully applied to the task of synthetic aperture radar (SAR) image change detection. However, conventional convolutional layers are inherently limited by their local receptive fields, which mainly capture spatially localized patterns while neglecting the global context that is often crucial for accurately distinguishing subtle or large-scale changes in SAR imagery. To address these limitations, we propose a novel Global Dynamic Context-Aware Network (GDNet) specifically tailored for SAR image change detection. At the core of our approach lies a novel global dynamic convolution module, which adaptively modulates convolution kernel weights according to the global semantic information extracted from the input features. By dynamically incorporating long-range dependencies, this mechanism enables the network to integrate both local detail and global context, thus improving its ability to detect diverse change patterns. In addition, we introduce a carefully designed two-stage Mixup strategy for model training. Unlike conventional single-stage Mixup, our two-stage design generates more diverse and informative training samples, effectively regularizing the model and yielding more stable and reliable classification results even under limited data scenarios. Extensive experiments on three SAR datasets demonstrate the superiority of the proposed GDNet compared to other state-of-the-art methods. These findings highlight the potential of global dynamic modeling and advanced data augmentation strategies for advancing SAR image interpretation. Source codes are available at \urlthis https URL.
[CV-268] EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
链接: https://arxiv.org/abs/2605.16745
作者: Zongyuan Yang,Mingjing Yi,Wanli Ma,Chenzhuo Fan,Bocheng Li,Baolin Liu,Yuke Lou,Yingde Song,Yongping Xiong,Zhengdong Guo,Shimu Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 10 figures, 6 tables. Technical report
Abstract:This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ( E_\mathrmund ) and a structurally mirrored Generation Expert ( E_\mathrmgen ), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: this https URL
[CV-269] Diffeomorphic Cortical Alignment via Direct Warping of Streamline Endpoints
链接: https://arxiv.org/abs/2605.16742
作者: Yang Xiang,Martin Cole,Zhengwu Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注:
Abstract:Cortical surface registration is often driven by local geometric descriptors (e.g., sulcal depth and curvature). While this approach achieves geometric correspondence, it neglects the long-range wiring constraints imposed by white-matter anatomy. Diffusion MRI tractography offers these crucial constraints; however, prior connectivity-informed pipelines typically align precomputed connectivity matrices, making the optimization highly sensitive to connectivity estimation and its resolution. In this paper, we introduce a novel connectivity-based surface registration method that aligns cortical surfaces by operating directly on white-matter fiber-tract endpoints. We model tract endpoints as a point cloud on the product manifold \Omega \times \Omega , where \Omega represents the spherical domain of the inflated cortical hemispheres. Our alignment method iteratively (i) computes a small diffeomorphic warp for \Omega by minimizing connectivity mismatch, and (ii) updates the endpoints based on this warp. The method relies on a geometric framework that ensures output warps are diffeomorphisms and has a final goal that optimizes the matching of well-known fiber bundles. Experiments on Human Connectome Project (HCP) data demonstrate improved tract-level correspondence, achieving higher connectivity-level overlap coefficients on major fiber bundles and stronger robustness across grid resolutions for \Omega compared to state-of-the-art methods such as ENCORE and MSMAll.
[CV-270] RACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation ACL2026
链接: https://arxiv.org/abs/2605.16740
作者: Pengyu Yan,Akhil Gorugantu,Mahesh Bhosale,Abdul Wasi,Vishvesh Trivedi,David Doermann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2026 Workshop
Abstract:Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard.
[CV-271] DriveSafer: End-to-End Autonomous Driving with Safety Guidance
链接: https://arxiv.org/abs/2605.16737
作者: Shounak Sural,Raj Rajkumar
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.
[CV-272] CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth
链接: https://arxiv.org/abs/2605.16736
作者: Anuska Roy,Pravin Nair
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at this https URL.
[CV-273] DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
链接: https://arxiv.org/abs/2605.16732
作者: Sayeh Sharify,Mahsa Salmani,Hesham Mostafa
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-\Sigma over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.
[CV-274] Compositional Adversarial Training for Robust Visual Watermarking
链接: https://arxiv.org/abs/2605.16720
作者: Anirudh Satheesh,Michael-Andrei Panaitescu-Liess,Andrew Xu,Georgios Milis,Heng Huang,Zikui Cai,Furong Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to 63.5% in the single-step attack setting and 13.0% in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR =1% by 12% on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.
[CV-275] MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
链接: https://arxiv.org/abs/2605.16716
作者: Shuowei Li,Yuming Zhao,Parth Bhalerao,Oana Ignat
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: [14] pages, [6] figures, [11] tables, appendix included. Preprint
Abstract:Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT
[CV-276] GeoWorld-VLM: Geometry from World Models for Vision-Language Models
链接: https://arxiv.org/abs/2605.16713
作者: Renjie Gu,Kaichen Zhou,Yan Luo,Mengyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model’s linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What’sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.
[CV-277] Face inpainting with Identity Preserving Latent Diffusion Models
链接: https://arxiv.org/abs/2605.16696
作者: João Santos,Carlos Santiago,Manuel Marques
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face inpainting techniques recover missing or occluded facial regions in a visually realistic manner, but preserving the identity in the final output remains a fundamental challenge. Identity consistency is crucial for downstream applications such as face recognition, digital forensics, and human-computer interaction, where even subtle identity distortions can significantly degrade performance or trust. Although diffusion-based generative models have recently achieved remarkable progress in image inpainting, they often struggle to faithfully retain individual-specific facial characteristics. On the other hand, existing identity-aware methods typically rely on costly fine-tuning, auxiliary supervision, or exhibit limited robustness to diverse occlusions, poses, and facial variations. To address these limitations, we propose ID-ControlNet, an identity-preserving face inpainting framework built upon latent diffusion models. Based on ControlNet architecture, our approach conditions the diffusion process on facial identity embeddings extracted from a pretrained face recognition network. This design enables reconstruction of occluded facial regions while maintaining global facial coherence and identity fidelity. Furthermore, we introduce an identity consistency and triplet loss training strategy that explicitly enforces alignment between the generated face and the target identity representation. Extensive experiments on CelebA-HQ, FFHQ, and on a new E-Mask dataset demonstrate that ID-ControlNet significantly improves identity preservation over standard diffusion-based inpainting methods, achieving performance comparable to SOTA identity-aware approaches.
[CV-278] Multi-Object Tracking Consistently Improves Wildlife Inference
链接: https://arxiv.org/abs/2605.16672
作者: Mufhumudzi Muthivhi,Jiahao Huo,Fredrik Gustafsson,Terence L. van Zyl
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in IEEE 2026 29th International Conference on Information Fusion
Abstract:Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.
[CV-279] Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents
链接: https://arxiv.org/abs/2605.16671
作者: Jiaxing Li,Hao Fang,Chi Xu,Miao Zhang,Jiangchuan Liu,William I. Atlas,Katrina M. Connors,Mark A. Spoljaric
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages
Abstract:Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.
[CV-280] Right Predictions Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
链接: https://arxiv.org/abs/2605.16651
作者: Narges Babadi,Hadis Karimipour
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model’s original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.
[CV-281] AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling
链接: https://arxiv.org/abs/2605.16649
作者: Ziyang Mai,Yuyao Zhang,Yu-Wing Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model’s pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.
[CV-282] SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation
链接: https://arxiv.org/abs/2605.16628
作者: John J. Han,Adam Schmidt,Max Allan,Jie Ying Wu,Omid Mohareri
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The SCARED dataset is a widely used benchmark for endoscopic depth estimation, offering ground-truth 3D reconstructions captured with a structured light sensor. However, the depth maps for non-keyframe images rely on robot kinematics that introduce substantial pose errors, limiting the reliably labeled portion of the dataset to 35 keyframes. We present SCARED-C, a corrected version of the SCARED dataset that expands the number of reliable RGB-D pairs from 35 to 17,135. Our pipeline applies COLMAP, a Structure-from-Motion system, to re-estimate camera poses for all frames, followed by a scale recovery step that aligns the resulting reconstructions to metric space using the ground-truth keyframe depth maps. We validate the corrected poses through (1) stereo disparity evaluation and (2) monocular depth estimation experiments. The corrected dataset and code are publicly released to the community.
[CV-283] Controlla: Learning Controllability via Graph-Constrained Latent Geometry
链接: https://arxiv.org/abs/2605.16603
作者: Jamuna S. Murthy,Amin Karimi Monsefi,Rajiv Ramnath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.
[CV-284] ArtMesh: Part-Aware Articulated Mesh Fields with Motion-Consistent Dynamics
链接: https://arxiv.org/abs/2605.16582
作者: Sylvia Yuan,Dan Wang,Ravi Ramamoorthi,Xinrui Cui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present ArtMesh, a mesh-native method for reconstructing articulated objects explicitly as connected triangle meshes with per-part rigid motion from multi-view images in start and end states. Existing 3D Gaussian Splatting pipelines for articulated reconstruction inherit the unstructured point-based geometry of their splatting base, which provides no surface topology for reasoning about part boundaries or enforcing motion consistency along the object’s connectivity. ArtMesh instead builds on a mesh-based differentiable rendering backbone, enabling part-aware dynamics to act directly on the structured topology. To make the topology compatible with articulation, we introduce part-aware restricted Delaunay remeshing, producing connected submeshes whose triangles do not cross semantic part boundaries. The dynamic mesh field then optimizes articulation using bidirectional Vertex-wise Motion Consistency on transported mesh vertices and Pixel-wise Motion Consistency on rendered RGB-D observations. We introduce Articulate-100, a new benchmark of 100 articulated objects spanning 16 PartNet-Mobility categories. On this benchmark, ArtMesh outperforms prior 3DGS-based pipelines in joint parameter estimation and part-level geometric reconstruction, with the largest gains on objects with many movable parts.
[CV-285] Attend Locally Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
链接: https://arxiv.org/abs/2605.16579
作者: Kunyang Li,Mubarak Shah,Yuzhang Shang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.
[CV-286] riALS: Triphasic-Aided Liver Lesion Segmentation Benchmark in Non-Contrast CT MICCAI2024
链接: https://arxiv.org/abs/2605.16572
作者: Marawan Elbatel,Mohamed Ghonim,Jiaji Mao,Zhuosheng Lin,Katharina Eckstein,Andrés Martínez Mora,Jonathan Deissler,Maximilian Rokuss,Constantin Ulrich,Zdravko Marinov,Wenhui Deng,Baoxun Li,Huijun Hu,Jun Shen,Mohanad Ghonim,Khadiga Omar Nassar,Mariam Elbakry,Menna Dyab,Amr Muhammad Abdo Salem,Nouran Elghitany,Noha Elghitany,Yi Qin,Xuanqi Huang,Haonan Wang,Shao-Woo Yen,Ahmed Elghamry Saba,Salma Ahmad,Xinyan Fang,Jiahao Zhang,Xiaodi Wang,Xinghua Ma,Gongning Luo,Jessica C. Delmoral,João Manuel R.S. Tavares,Ankan Deria,Adinath Dukre,Yutong Xie,Imran Razzak,Dongwook Kim,Matthew Choi,Hanxiao Zhang,Minghui Zhang,Xin You,Abdul Qayyum,Steven A. Niederer,Moona Mazher,Rachika E. Hamadache,Ricardo Montoya-del-Angel,Robert Martí,Xavier Lladó,Toufiq Musah,Livingstone Eli Ayivor,Enrique Almar-Munoz,Agnes Mayr,Kaouther Mouheb,Esther E. Bron,Stefan Klein,Ahmed Abouelhoda,Amira Adel,Susan Adil Ali,Rainer Stiefelhagen,Klaus H. Maier-Hein,Fabian Isensee,Aya Yassin,Xiaomeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TriALS challenge paper across MICCAI 2024 and 2025; data and code at this https URL
Abstract:Automated segmentation of liver lesions on non-contrast computed tomography (NCCT) is clinically important but fundamentally challenging, particularly in low-resource settings across Africa and Asia where contrast agents are frequently unavailable. Progress has been limited by the absence of annotated NCCT benchmarks. Here we describe the TriALS challenge for automated liver lesion segmentation under contrast-limited conditions, supported by a multi-centre dataset of 150 cases with four-phase CT acquisitions (600 volumes) from Egyptian and Chinese institutions. Algorithms were evaluated on 70 cases from three institutions, including an independent external cohort. The top-performing method achieved a mean venous-phase Dice of 0.754, consistent with human-level performance, yet dropped to 0.57 on NCCT. On external validation, the leading method outperformed off-the-shelf models by up to 28% in Dice on NCCT. Algorithm performance was most strongly predicted by training data scale and pre-training strategy. A cross-year comparison exposed a persistent perceptual barrier on NCCT that scaling pre-training alone cannot overcome. Data, annotations, and code are available at this https URL.
[CV-287] Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition ICIP2026
链接: https://arxiv.org/abs/2605.16550
作者: Luiz G F Carreira,Breno A Mariano,Victor H C de Melo,David Menotti,William Robson Schwartz
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICIP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication
Abstract:Video periocular recognition is the task of recognizing an individual’s identity based on the region around an individual’s eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves 99.8% of TPR@ 1e^-1 and 96.6% of Rank-5.
[CV-288] SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation
链接: https://arxiv.org/abs/2605.16530
作者: Ssharvien Kumar Sivakumar,Akwele Johnson,Anirudh Dhingra,Yannik Frisch,Ghazal Ghazaei,Anirban Mukhopadhyay
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: this https URL
[CV-289] DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy ICPR2026
链接: https://arxiv.org/abs/2605.16519
作者: Zhuoyu Wu,Wenhui Ou,Lexi Zhang,Pei-Sze Tan,Dongjun Wu,Junhe Zhao,Wenqi Fang,Raphaël C.-W. Phan
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: This paper has been accepted to the International Conference on Pattern Recognition (ICPR 2026)
Abstract:Accurate polyp segmentation in colonoscopy is essential for early colorectal cancer detection, yet real-world clinical environments pose persistent challenges such as motion blur, specular reflections, and illumination instability. Most existing methods are optimized on clean benchmark images and suffer noticeable performance degradation when deployed in authentic surgical scenarios. We propose DepthPolyp, a lightweight and robust segmentation framework based on pseudo-depth-guided multi-task learning and efficient feature modulation. The architecture combines hierarchical Ghost factorization for compact feature generation, Interleaved Shuffle Fusion for low-cost cross-scale interaction, and Dynamic Group Gating for adaptive group-wise feature weighting. Extensive experiments demonstrate that DepthPolyp achieves strong cross-dataset generalization when trained on degraded data and evaluated on both clean and noisy target domains, consistently outperforming lightweight baselines and remaining competitive with substantially larger models. In real surgical video evaluation on PolypGen, DepthPolyp achieves better segmentation performance than models up to 20\times larger while preserving real-time inference speed. With only 3.57M parameters and 0.86 GMACs, the proposed method runs at over 180 FPS on mobile devices, making it well suited for real-time deployment in resource-constrained clinical environments. Code and pretrained weights are available at: this https URL
[CV-290] SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability
链接: https://arxiv.org/abs/2605.16515
作者: Amin Karimi Monsefi,Abolfazl Meyarian,Mridul Khurana,Shuheng Wang,Pouyan Navard,Cheng Zhang,Anuj Karpatne,Wei-Lun Chao,Rajiv Ramnath
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam’s utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. this https URL
[CV-291] Visual Agent ic Memory: Enabling Online Long Video Understanding via Online Indexing Hierarchical Memory and Agent ic Retrieval
链接: https://arxiv.org/abs/2605.16481
作者: Aiden Yiliu Li,Nels Numan,Anthony Steed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at this https URL.
[CV-292] Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning
链接: https://arxiv.org/abs/2605.16477
作者: Mengye Ren
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages
Abstract:What does it mean to create a new concept, rather than retrieve a familiar one? Repeatedly sampling a generative model at the same prompt produces variations with similar styles and typical content. We propose that creativity is the production of stimuli that are unfamiliar to an adaptive observer at first sight, but quickly learnable from a few exposures. We formalize this as a Creator-Appraiser pair: a Creator generates a candidate, an Appraiser adapts to it for a few inner-loop learning steps, and the Appraiser’s improvement becomes the reward the Creator optimizes through. We instantiate the framework with diffusion as the Creator, an autoencoder Appraiser on MNIST, and a CLIP Appraiser with a low-rank adapter for natural images. The diffusion model remains frozen with no additional language conditioning; the meta-learning gradient is enough to produce both stylistic variations and concept compositions that the base model does not generate on its own.
[CV-293] MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation
链接: https://arxiv.org/abs/2605.16464
作者: Hanjun Tao,Hua Wang,Fan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables
Abstract:Brain tumors exhibit high heterogeneity in morphology and multimodal contrast, making manual slice-by-slice de lineation time-consuming and experience-dependent, thus necessitating efficient and stable automated segmentation methods. To address the limitations of CNNs in modeling long-range dependencies, and the heavy computational and memory overhead and inter-block contextual in coherence of Transformers in 3D MRI, this paper proposes Multi-Head Mamba (MHMamba). This method combines a U-shaped architecture with a multi-head state-space model (Mamba), splitting the channel dimension into parallel SSM heads and aggregating them with residuals. This enhances long-range representation and improves the stability of multimodal training while maintaining linear complexity. To further align statistics and enhance lesion response, we designed a channel-space calibration module for multi-head outputs and introduced an adaptive fusion mechanism at skip connections to dynamically connect global semantics with local details, thereby improving boundary consistency and the detection of small-volume lesions. We conducted experiments and ablations on BraTS2021 and BraTS2023. The results showed that MHMamba achieved stable and significant improvements in overall accuracy, boundary smoothness, and sensitivity to tumor core and small-volume enhancement areas, while preserving the linear-complexity advantage of Mamba-based modeling, thus verifying the effectiveness and versatility of the method.
[CV-294] REC-RL: Referring expression counting via Gaussian and range-based reward optimization
链接: https://arxiv.org/abs/2605.16460
作者: Hui Liu,Yunlai Teng,Kunlong Bai,Pengfei Qi,Haotian Yan,Liang Li,Junlan Feng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages
Abstract:Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.
[CV-295] Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery
链接: https://arxiv.org/abs/2605.16458
作者: Weijun Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint manuscript, 16 pages, 4 figures, 3 tables. This manuscript presents a residual-bounded 2.5D CT/CTA restoration framework for conservative medical image enhancement and evaluates it using image-recovery, baseline comparison, Monte Carlo stability, anatomical localization, and external low-dose CT testing
Abstract:Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.
[CV-296] Identifiable Token Correspondence for World Models
链接: https://arxiv.org/abs/2605.16457
作者: Youngin Kim(1),Ray Sun(2),Inho Kim(2),Bumsoo Park(3),Hyun Oh Song(1 and 2) ((1) Interdisciplinary Program in Artificial Intelligence, Seoul National University, (2) Department of Computer Science and Engineering, Seoul National University, (3) KRAFTON)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without explicitly modeling correspondence between tokens across time. We formulate next-frame prediction as a structured probabilistic inference problem with latent token correspondence variables, deriving a model in which each next-frame token is explained either by copying a token from the previous frame or by generating a new token. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on this https URL.
[CV-297] Multi-hop Relational Contrastive Learning: Extending Spatial Contrastive Pre-training Beyond Pairwise Relations
链接: https://arxiv.org/abs/2605.16456
作者: Sheikh Tanvir Ahmed,Md. Tanvir Raihan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding how objects relate to each other in space is fundamental to scene understanding, yet most contrastive pre-training approaches only model pairwise relationships, leaving richer compositional and multi-hop interactions largely unexplored. We introduce Multi-Hop Relational Contrastive Learning (MRCL), a framework that extends spatial contrastive learning to graph-structured scene representations. By tracing k-hop paths through scene graphs built from detected objects, MRCL captures implicit spatial dependencies that go well beyond what direct object pairs can express. We define a multi-level contrastive objective spanning nodes, edges, and multi-hop paths, encouraging embeddings that remain stable across object semantics while staying responsive to spatial layout. On a GQA subset, MRCL produces spatially-aware representations that improve content-based graph retrieval (NDCG@5 = 0.748) and consistently benefit downstream tasks, including spatial relationship recognition and graph-based question answering. Together, these results suggest that multi-hop relational supervision offers substantially richer structural guidance than pairwise-only methods, leading to visual representations that are more robust, compositional, and geometry-aware.
[CV-298] Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images
链接: https://arxiv.org/abs/2605.16444
作者: Liangrui Pan,Jiadi Luo,Yuxuan Xiao,Chenchen Nie,Xiaoshuai Wu,Songqing Fan,Ling Chu,Manqiu Li,Rongfang He,Zhenyu Zhao,Ruixing Wang,Shulin Liu,Yiyi Liang,Xiang Wang,Qingchun Liang,Shaoliang Peng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Nature Communications
Abstract:Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.
[CV-299] Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification
链接: https://arxiv.org/abs/2605.16440
作者: Daniel Brignac,Fengwei Tian,Banafsheh Latibari,Abhijit Mahalanobis,Ravi Tandon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.
[CV-300] KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy
链接: https://arxiv.org/abs/2605.16439
作者: Yingbing Huang,Tharun Adithya Srikrishnan,Steven K. Reinhardt,Deming Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16439 [cs.CV] (or arXiv:2605.16439v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.16439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-301] CT-DegradBench: A Physics-Informed Benchmark for CT Degradation Detection and Severity Estimation CVPR2026
链接: https://arxiv.org/abs/2605.16431
作者: Yousra Nabila Taifour,Marouane Tliba,Zuheng Ming,Marie Luong,Nour Aburaed,Aladine Chetouani,Gorkem Durak,Alessandro Bruno,Faouzi Alaya Cheikh,Habib Zaidi,Ulas Bagci,Azeddine Beghdadi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026 VISION Workshop (DEXTER track)
Abstract:Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a training-free semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at this https URL.
[CV-302] EAGT: Echocardiography Augmentation for Generalisability and Transferability
链接: https://arxiv.org/abs/2605.16427
作者: Soroush Elyasi,Sara Adibzadeh,Nasim Dadashi Serej,Julie Wall,Massoud Zolgharni
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.
[CV-303] Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization
链接: https://arxiv.org/abs/2605.16423
作者: Peilin Sun,Jianxin Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Network quantization has emerged as one of the most practical model compression techniques, which significantly reduces a model’s memory and compute consumption by mapping floating-point numbers to low-bit representations. However, existing quantization methods typically suffer from the speed-accuracy tradeoff and limited generalization. To address these issues, recent compensation-based methods offer an efficient yet general solution by introducing additional lightweight linear layers into the quantized network. However, the accuracy of these methods suffers from their limited compensation capability and high sensitivity to outliers. In this paper, we propose Nonlinear Bipolar Compensation (NBC), a post-training quantization approach that introduces nonlinear compensation to reduce the effect of outliers. We further design Bipolar Logarithmic Transformation (BLT), which compresses outliers by mapping both the quantized input and the quantization error into a transformed space. A simple linear layer is then applied for compensation in the transformed space, preserving the efficiency of our method. Extensive experiments across various tasks, models, and quantization methods confirm the effectiveness, efficiency, robustness, and generality of our NBC approach.
[CV-304] Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance MDM2026
链接: https://arxiv.org/abs/2605.16420
作者: Stelio Bompai,Ioannis Kontopoulos,Giannis Spiliopoulos,Dimitris Zissis,Konstantinos Tserpes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 1st Workshop on Multi-Sensor Trajectory Knowledge Discovery and Extraction (MuseKDE 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (IEEE MDM 2026)
Abstract:This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.
[CV-305] Agent ic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments
链接: https://arxiv.org/abs/2605.16419
作者: Juncheng Yu,Lusi A,Haoxuan Xie,Weiming Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by EMBC 2026. 7 pages, 3 figures
Abstract:Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of 5.97^\circ \pm 2.36^\circ and a Pearson correlation coefficient of 0.962 \pm 0.014 . The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.
[CV-306] Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment
链接: https://arxiv.org/abs/2605.16418
作者: Fan Yin,Chuhang Zheng,Peiliang Gong,Donghai Guan,Qi Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.
[CV-307] CAVE: A Structured Credit Assignment Approach for Frag mented Visual Evidence Reasoning
链接: https://arxiv.org/abs/2605.16416
作者: Tengda Guo,Jie Leng,Hanlei Li,Yaoyuan Liang,Qingyue Zhang,Dian Yang,Mingyu Zhang,Yuhua Fu,Shao-Lun Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures. Preprint
Abstract:Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.
[CV-308] Diffusion Models Denoiser Architecture and Creativity
链接: https://arxiv.org/abs/2605.16415
作者: Itamar Levine,Yair Weiss
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.
[CV-309] NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research
链接: https://arxiv.org/abs/2605.16414
作者: Omar Mansour,Pietro Martinello,Ethan Milon,YingFu Xu,Manolis Sifalakis,Guangzhi Tang,Amirreza Yousefzadeh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in ICJNN 2026 Maastricht
Abstract:We present NERVE (Neuromorphic Vision and Radar Ensemble), a multi-sensor dataset comprising 257 minutes of synchronized recordings from five sensors: two Dynamic Vision Sensors (DVS), an RGB-D camera, and two Radar units (24GHz and 77GHz). Captured across 12 measurement days in office environments, NERVE contains around 600GB of uncompressed temporally aligned data with around 914,000 frames and around 9.6 million RGB COCO-formatted annotations covering 16 relevant object categories. To evaluate multi-modal fusion, we construct a DVS+Radar subset for human detection and distance estimation. Baseline experiments using feed-forward and recurrent detectors show that combining DVS with 77GHz Radar consistently improves detection, with recurrent models achieving up to 47.5% mAP and mean absolute Radar distance errors below 1.8m against LiDAR ground truth.
[CV-310] SCAR: Self-Supervised Continuous Action Representation Learning
链接: https://arxiv.org/abs/2605.16412
作者: Hongjia Liu,Fan Feng,Minghao Fu,Xinyue Wang,Haofei Lu,Biwei Huang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.
[CV-311] st-Time Hinting for Black-Box Vision-Language Models
链接: https://arxiv.org/abs/2605.16410
作者: Kaihua Hou,Abhijith Varma Mudunuri,Jiaxing Qiu,Roxana Daneshjou,Thomas Hartvigsen,Ahmed Alaa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-time scaling (TTS) methods have proven highly effective for LLMs, yet their application to vision-language models (VLMs) remains relatively underexplored. Existing VLM TTS methods largely require open-weight model access or expensive repeated sampling, and are evaluated primarily on multimodal mathematical and scientific reasoning benchmarks rather than general visual understanding tasks. In this paper, we propose Test-Time Hinting, a method that improves VLM performance via a single VLM call and requiring only black-box API access, which makes it broadly applicable to frontier closed-weight models. Our method is motivated by the observation that VLM errors tend to cluster around recurring failure patterns. We therefore train a lightweight hint generator model to predict, for a given test input, which “hint” should be prepended to the prompt, providing targeted contextual or procedural guidance that steers the VLM away from its characteristic failure modes. We show that Test-Time Hinting improves the accuracy of multiple closed-weight VLMs on natural-image VQA benchmarks and that these gains generalize to unseen benchmarks and VLMs without retraining the hint generator.
[CV-312] Visual Search Patterns in 3D Pancreatic Imaging: An Eye Tracking Study
链接: https://arxiv.org/abs/2605.16408
作者: Anna Anikina,Leila Khaertdinova,Trine Balschmidt,Michael B Andersen,Christoph F Müller,Erik GS Brandt,Henrik S Thomsen,Claudia Mello-Thoms,Bulat Ibragimov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SPIE - Medical Imaging Conference 2026
Abstract:Eye tracking has emerged as a powerful tool for examining visual perception and search strategies in various domains, including medicine. While it is relatively straightforward to apply in 2D settings, its use in 3D medical imaging remains challenging and not yet well explored. This gap is particularly relevant for radiology, where volumetric images such as computed tomography (CT) scans are routinely read by medical experts. Radiologists typically interpret these images by navigating through hundreds of 2D slices, most often viewed in the axial projection. A taxonomy of eye movement data during navigation through a CT volume could be valuable to understand how radiologists approach diagnostic tasks. As an example of the derived taxonomy, we asked two radiologists to search abdominal CTs of the pancreas. We collect eye tracking data and align eye gaze movements with slice navigation to visualize the representation of the pancreas through volume and analyze clinicians’ gaze behavior in both space and time.
[CV-313] Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection
链接: https://arxiv.org/abs/2605.16406
作者: Franky George,Muhammad Khalid,Adil Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Night-time pedestrian detection remains challenging because labelled night-time data are limited and large illumination differences make daytime-only trained detectors unreliable. Latent diffusion models (LDMs) provide a powerful basis for image-to-image translation and cross-domain augmentation, but their effectiveness in safety-critical perception depends on whether detector-relevant objects and local semantic structure are preserved when translating between source and target domains. In this work, we present Contrastive-SDXL, a day-to-night augmentation framework for night-time pedestrian detection built on SDXL-Turbo and fine-tuned using Low-Rank Adaptation (LoRA). To preserve semantic correspondence between daytime inputs and translated night-time images, we introduce a patch-wise semantic contrastive loss guided by a pretrained DINOv2 encoder rather than generator encoder features. Multi-level DINOv2 self-attention maps enforce both local and global semantic consistency, while an object consistency loss explicitly encourages pedestrian preservation. Contrastive-SDXL produces realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5. Detectors trained with our synthetic images obtain a 6-7% reduction in miss rate compared with a daytime-only baseline, approaching the performance of detectors trained on real night-time data. These results demonstrate that consistency-driven diffusion augmentation can effectively support safety-critical night-time pedestrian this http URL
[CV-314] Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations
链接: https://arxiv.org/abs/2605.16405
作者: Nicola Debole,Andrea Passerini,Stefano Teso,Andrea Pugnana,Emanuele Marconato
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts – and the predictions they entail – by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM’s embedding space, which captures useful global information about the target domain, to propagate the expert’s supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.
[CV-315] Hybrid Quantum-MambaVision: A Quantum-enhanced State Space Model for Calibrated Mixed-type Wafer Defect Detection
链接: https://arxiv.org/abs/2605.16404
作者: Satwik Sai Prakash Sahoo,Jyoti Prakash Sahoo,Ting Wang,Subrota Kumar Mondal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extracting actionable knowledge from industrial visual data is fundamentally bottlenecked by extreme class imbalance and the prohibitive computational complexity of modern foundation models. In semi-conductor manufacturing, identifying multi-label wafer defects is a complex spatial data mining task where overlapping patterns obscure critical root-cause signals. While Vision Transformers (ViTs) excel at global dependency extraction, their quadratic scaling renders them inefficient for high-throughput, real-time anomaly detection. To overcome these computational barriers, this paper introduces Hybrid Quantum-MambaVision, a highly efficient architecture tailored for spatial knowledge discovery. We integrate a linear-complexity State-Space Model (SSM) backbone with a Parameterized Quantum Context Adapter (QCA) and Low-Rank Adaptation (LoRA). The Mamba backbone efficiently captures long-range spatial dependencies, while the quantum adapter maps compressed latent features into a high-dimensional Hilbert space to disentangle complex, overlapping signatures. On the highly imbalanced MixedWM38 dataset, Hybrid Quantum-MambaVision achieves exceptional multi-label classification performance, significantly reducing the error rate on complex multi-defect topologies compared to classical baselines. The quantum regularizer acts as a profound uncertainty calibrator, substantially reducing Maximum Calibration Error (MCE) and minimizing expected false-positive costs. This work establishes a scalable Quantum-Classical hybrid paradigm for efficient representation learning in industrial data mining.
[CV-316] When Vision Speaks for Sound
链接: https://arxiv.org/abs/2605.16403
作者: Xiaofei Wen,Wenjie Jacky Mo,Xingyu Fu,Rui Cai,Tinghui Zhu,Wendi Li,Yanan Xie,Muhao Chen,Peng Qi
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 24 pages, 10 figures
Abstract:Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.
[CV-317] WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments
链接: https://arxiv.org/abs/2605.16402
作者: Haoren Zhao,Tianyi Chen,Zhen Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at this https URL.
[CV-318] CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification ICIP2026
链接: https://arxiv.org/abs/2605.16401
作者: Turkoglu Mikael,Bary Tim,Thielens Vincent,Dausort Manon,Macq Benoît
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 1 table, Accepted at ICIP 2026
Abstract:While high-capacity AI models have advanced state-of-the-art performance, their practical deployment is often hindered by high inference costs, environmental impact, and a “one-size-fits-all” approach that ignores varying sample complexity. In clinical settings for instance, the waste of computational resources on routine cases is a significant barrier to sustainable AI. In this paper, we introduce the Conformal Adaptive Decision System (CADS), a sequential multi-model algorithm designed to optimize resource allocation by efficiently sampling models based on the estimated data complexity. CADS leverages conformal prediction to quantify image uncertainty at runtime. CADS provides a mathematically grounded framework for balancing the cost-accuracy dilemma that dynamically routes samples through a model cascade, ranging from lightweight “Scout” models to high-capacity “Oracle” architectures. Validated on two datasets, CADS demonstrated superior efficiency and accuracy at a computational cost that can be up to 12 times lower than heavy-model inference. By accurately routing samples based on real-time complexity, CADS ensures high diagnostic reliability while drastically reducing the economic and environmental footprint of AI.
[CV-319] Stable and Near-Reversible Diffusion ODE Solvers for Image Editing
链接: https://arxiv.org/abs/2605.16399
作者: Barbora Barancikova,Daniil Shmelev,Cristopher Salvi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The inversion of diffusion models plays a central role in image editing. Algebraically reversible ODE solvers provide an appealing approach to diffusion inversion for text-guided image editing, by eliminating the inversion error inherent in DDIM-based editing pipelines. However, empirical results indicate that reversibility alone is insufficient. As edits require larger semantic or visual changes, reversible diffusion solvers often exhibit instabilities and suffer sharp drops in output quality. In this paper, we show that the trade-off between exact reversibility and numerical stability manifests empirically as a trade-off between background preservation and prompt alignment in image editing. We then investigate the use of near-reversible Runge-Kutta methods as a more stable alternative to exactly reversible diffusion schemes. When combined with a vector-field smoothing strategy, the resulting approach improves edit fidelity, remains stable under large edits, and largely retains the background-preservation benefits of reversible solvers.
[CV-320] rajectory-Aware Adaptive Inference in Object Detection Models MDM2026
链接: https://arxiv.org/abs/2605.16397
作者: Grigorios Papanikolaou,Ioannis Kontopoulos,Giannis Spiliopoulos,Dimitris Zissis,Konstantinos Tserpes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the MuseKDE workshop of the IEEE MDM 2026 conference
Abstract:The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network’s architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.
[CV-321] Beyond MMSE: Enhancing PnP Restoration with ProxiMAP
链接: https://arxiv.org/abs/2605.16396
作者: Kenta Vert,Giacomo Meanti,Scott Pesme,Michael Arbel,Julien Mairal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Plug-and-Play (PnP) methods have become standard tools for solving imaging inverse problems by replacing the intractable maximum a posteriori (MAP) denoiser with the MMSE one. While this mismatch has been widely treated as unavoidable, recent works have sought to close this gap by targeting the MAP with diffusion-model scores. We show this is problematic in practice: learned scores do not match the true ones, so MAP-targeting iterations converge to cartoon-like images rather than realistic ones, and better results are obtained by stopping short of convergence. We turn this observation into a design principle and introduce ProxiMAP, an iterative MAP approximation whose noise schedule keeps the iterate’s residual noise matched to the denoiser’s training noise. This keeps the denoiser in-distribution where its score is reliable, and yields implicit early stopping that avoids the failure mode above. ProxiMAP is a modular drop-in replacement for MMSE denoisers in standard PnP algorithms and consistently sharpens reconstructions across deblurring, inpainting, super-resolution, and phase retrieval. Building on the same principle, we propose a hybrid variant that applies ProxiMAP only in the late iterations of PnP, where the denoiser is most reliable – matching or exceeding the full-replacement variant at a fraction of the cost.
[CV-322] Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation
链接: https://arxiv.org/abs/2605.16393
作者: Joel Valdivia Ortega,Tingying Peng,Marion Jasnin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.
[CV-323] Inducing Spatial Locality in Vision Transformers through the Training Protocol
链接: https://arxiv.org/abs/2605.16390
作者: Eduardo Santiago Toledo,Asael Fabian Martínez
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.
[CV-324] ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding
链接: https://arxiv.org/abs/2605.16388
作者: Phuc H. Nguyen,Trung T. Nguyen,Quy N. Duong,Van-Dinh Nguyen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, IEEE ICCE 2026
Abstract:Semantic communication (SC) aims to reduce transmission overhead by conveying task-relevant information rather than raw data. However, existing SC approaches for video largely focus on pixel-level reconstruction or rely on complex spatiotemporal pipelines, leading to excessive bandwidth usage and latency that are unsuitable for low-resource deployments. In this paper, we propose ChronoSC, a task-oriented semantic communication framework for Video Question Answering (VideoQA). ChronoSC introduces Chrono-Color Stacking, a lightweight and lossless projection scheme that encodes temporal video dynamics into a single static image, enabling extreme temporal compression before transmission. This compact semantic representation is transmitted using a lightweight Deep Joint Source-Channel Coding (DeepJSCC) transceiver and explicitly reconstructed at the receiver. Unlike latent-space methods, explicit visual reconstruction enables the direct reuse of pre-trained vision-language models; specifically, a pre-trained BLIP model is employed to infer answers from noisy, reconstructed chrono-images. Experiments on the CLEVRER dataset show that ChronoSC achieves up to 192 times bandwidth reduction compared to raw video transmission while maintaining high VideoQA accuracy.
[CV-325] Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition MICCAI2026
链接: https://arxiv.org/abs/2605.16387
作者: Yang Liu,Ning Zhu,Jingjing Peng,Xiwu Chen,Alejandro Granados,Guotai Wang,Sebastien Ourselin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early accepted by MICCAI 2026
Abstract:Online Surgical Phase Recognition (SPR) models can reach high frame-wise accuracy, yet their predictions often lack temporal stability, fragmenting workflow understanding and reducing the reliability of downstream assistance. We show that this instability is not random noise but arises from two mechanisms: early misclassifications corrupt temporal feature states and propagate forward to form error cascades, and phase transitions follow evidence-accumulation dynamics whereas most online SPR systems rely on memoryless frame-wise decisions, making them sensitive to transient confidence fluctuations. We propose a unified Train-Inference-Evaluation framework that explicitly stabilizes temporal inference dynamics using model-agnostic, plug-and-play components. For training, the Temporal Error-Cascade (TEC) loss suppresses error onset and mitigates forward error propagation by stabilizing temporal feature evolution. For inference, the Evidence-Gated Transition Predictor (EGTP) enforces evidence-driven state transitions, allowing phase changes only when accumulated evidence exceeds a confidence boundary. For evaluation, we introduce the Temporal Fragmentation Index (TFI), a reliability-aware metric that quantifies instability-induced temporal disagreement beyond conventional frame-wise and token-based measures. Experiments on Cholec80 and AutoLaparo across three representative backbones show that the proposed framework substantially improves temporal stability and reduces prediction fragmentation, while maintaining or modestly improving frame-wise performance.
[CV-326] Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
链接: https://arxiv.org/abs/2605.16386
作者: Jiaqing Zhang,Sandeep Elluri,Bhanu Cherukuvada,Yonah Joffe,Jessica Sena,Miguel Contreras,Scott Siegel,Subhash Nerella,Catherine Price,Parisa Rashidi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.
[CV-327] Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice
链接: https://arxiv.org/abs/2605.16384
作者: Xiusheng Huang,Xin Jiang,Jun Zhao,Kang Liu,Yequan Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures
Abstract:Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok’s state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.
[CV-328] A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification
链接: https://arxiv.org/abs/2605.16383
作者: Ezel Kilicdere,Shireen Kudukkil Manchingal,Fabio Cuzzolin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 36 pages
Abstract:Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.
[CV-329] StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
链接: https://arxiv.org/abs/2605.16381
作者: Ao Li,Zihan Xiao,Zihao Yue,Boshen Xu,Linli Yao,Jiaze Li,Pei Fu,Jianzhong Ju,Jian Luan,Qin Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a “see-then-answer” paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model’s ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model’s ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.
[CV-330] Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT
链接: https://arxiv.org/abs/2605.16373
作者: Zonglin Yang,Xiaolei Diao,Jishizhan Chen,Xiaozhuang Man,Wei Kong,Gen Wen,Pengfei Cheng,Daqian Shi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal this http URL mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.
[CV-331] SwordBench: Evaluating Orthogonality of Steering Image Representations
链接: https://arxiv.org/abs/2605.16372
作者: Vladimir Zaigrajew,Dawid Pludowski,Hubert Baniecki,Przemyslaw Biecek
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.
[CV-332] GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
链接: https://arxiv.org/abs/2605.16371
作者: Jinhao Jing,Zheng Ma,Jinwei Liang,Qiannian Zhao,Shawn Chen,Jing Yang,Por Lip Yee,Prayag Tiwari,Jingjing Bai,Benyou Wang,Lewei Lu,Zhan Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at this https URL and this https URL.
[CV-333] Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLM s
链接: https://arxiv.org/abs/2605.16366
作者: Yigui Feng(1),Qinglin Wang(1),Yang Liu(2),Jie Liu(1) ((1) The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China, (2) The Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, Guangdong, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures
Abstract:Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbfFre-Res, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy–efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.
[CV-334] How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F3A
链接: https://arxiv.org/abs/2605.16359
作者: YiJie Huang,Yiqun Zhang,Zhuoyue Jia,Xiaocui Yang,Junzhao Huang,Zihan Wang,Shi Feng,Daling Wang,Yifei Zhang,Yongkang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.
[CV-335] Generative 3D Gaussians with Learned Density Control SIGGRAPH
链接: https://arxiv.org/abs/2605.16355
作者: Runjie Yan,Yan-Pei Cao,Peng Wang,Ding Liang,Yuan-Chen Guo
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures, SIGGRAPH Conference Papers '26
Abstract:We present Density-Sampled Gaussians (DeG), a novel 3D representation designed to bridge the gap between adaptive rendering primitives and scalable generative modeling. Unlike existing approaches that constrain 3D Gaussians to fixed voxel grids or arrays, DeG models Gaussian centers as samples from a learnable probability density function defined over an octree. This formulation provides a rigorous mathematical framework for adaptive density control: by jointly optimizing the spatial density and Gaussian attributes under rendering supervision, our model naturally concentrates primitives in regions of high geometric complexity. We achieve this via a new render loss contribution gradient that serves as a fully differentiable analogue to the discrete densification and pruning heuristics used in standard Gaussian Splatting. The resulting representation is highly flexible, supporting variable-resolution decoding from a single latent code by simply adjusting the sampling budget. To enable generative synthesis, we train a latent diffusion model on DeG. We identify a critical challenge in applying diffusion to unordered set-structured latents, which can significantly slow convergence, and propose VecSeq, a canonical re-indexing mechanism that anchors latent tokens to a deterministic 3D Sobol sequence. This transforms the ambiguous set-generation problem into a robust sequence modeling task. Extensive experiments demonstrate that our pipeline achieves state-of-the-art quality in single-image-to-3D generation, combining the structural adaptivity of unstructured primitives with the training stability of grid-based methods.
[CV-336] StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLM s
链接: https://arxiv.org/abs/2605.16353
作者: Chang Che,Ziqi Wang,Hui Ma,Cheems Wang,Zenglin Shi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model’s abilities from continuously evolving data streams.
[CV-337] Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model
链接: https://arxiv.org/abs/2605.16317
作者: Owen Root,Julinda Mujo,Min Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: Main paper: 29 pages, 18 figures; Supplemental Information: 22 pages, 15 figures
Abstract:Accurate, unified models for event cameras (ECs) remain elusive, hampering calibration and algorithm design. We develop a foundational probabilistic model for EC event detection, grounded in photon statistics, that unifies the description of static scene noise events and step response curves (S-curves) within a single analytical framework. Three formulations of the probability distributions are derived, spanning all intensity regimes: exact Poisson, saddle-point, and Gaussian. The model reveals the underlying connection between these otherwise disparate EC behaviors and clarifies the interpretation of S-curves, which we show is more nuanced than selecting a fixed probability threshold. Based on this model, we propose Noise2Params, a method for determining camera-specific values of the log-contrast threshold B , the lux-to-photon conversion factor \alpha , and the leakage term \theta (found to be intensity dependent), via error minimization against observed noise-event distributions. Noise2Params requires only recordings of static, uniform scenes, offering an experimentally accessible alternative to approaches that demand specialized dynamic light sources. We further support the validity the model by training convolutional neural networks (CNNs) on synthetic noise images generated from our distributions and evaluating their ability to reconstruct static scenes from experimental data. We further demonstrate the utility of our model by showing that CNNs incorporating synthetic data outperform those trained solely on experimental data. Our framework provides a quantitative foundation for EC calibration, noise-aware algorithm design, and applications in photon-limited regimes.
[CV-338] Patchwork: A compact representation for 3D polygonal shapes
链接: https://arxiv.org/abs/2605.16266
作者: Ruichen Zheng,Biao Zhang,Michael Birsak,Mikhail Skopenkov,Peter Wonka
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce Patchwork, a new general-purpose shape representation capable of modeling 2D and 3D geometry with a small number of parameters. Patchwork is grounded in a rigorous mathematical framework, providing provable complexity bounds and the ability to approximate arbitrary shapes with arbitrary precision in any dimension. We propose an efficient gradient-based optimization scheme to fit Patchwork representations to 2D and 3D data, along with a novel regularization loss that progressively prunes redundant elements, yielding high compactness after convergence. Our approach offers fast fitting performance, a fraction of the required parameters compared to existing alternatives, and native support for inside-outside classification, making it a versatile and compact representation for geometric learning and reconstruction tasks, with future potential for 3D generation. Our implementation is available at: this https URL.
[CV-339] CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery
链接: https://arxiv.org/abs/2605.18054
作者: Tung-I Chen,Lingdong Wang,Subhransu Maji,Ramesh K. Sitaraman
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Volumetric media promises next-generation content delivery applications, but its bandwidth demand remains a key bottleneck. Implicit and hybrid volumetric representations reduce model sizes, yet still require careful coding to reach 2D video-like bitrates. We present CATRF, a standard-codec-in-the-loop compression framework for plane-factorized radiance fields. During training, we quantize and pack 2D feature planes into codec-friendly canvases, run a standard codec roundtrip (JPEG/VP9/HEVC/AV1), then unpack and dequantize the decoded features before volume rendering. We use a straight-through estimator (STE) to insert the non-differentiable, standard codec pipeline into the training loop, allowing radiance-field features to adapt directly to the real, client-side codec distortions without introducing any learned codec parameters. On both static and dynamic benchmarks, CATRF consistently achieves a better rate-distortion trade-off over codec-agnostic and learned-codec-in-the-loop baselines, and also outperforms recent compressed 3DGS methods in both compression efficiency and decoding speed. These results highlight a practical path toward low-bitrate, compression-resilient volumetric representations for free-viewpoint video streaming.
[CV-340] Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression
链接: https://arxiv.org/abs/2605.18006
作者: Chang Sun,Hui Yuan,Shiqi Jiang,Chongzhen Tian,Guanghui Zhang,Raouf Hamzaoui
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 14 pages, 12 figures
Abstract:Because LiDAR sensors acquire point clouds with a fixed angular resolution, the resulting data can be systematically parameterized and efficiently compressed in the spherical coordinate system. Traditional spherical coordinate-based point cloud compression methods have demonstrated strong rate-distortion (RD) performance, with the predictive geometry coding (PredGeom) method in the geometry-based point cloud compression (G-PCC) standard being a prominent example. Although PredGeom includes an inter-frame prediction mode, it relies on a simple linear model, which limits its ability to capture complex motion patterns and structural dependencies. Meanwhile, existing learning-based compression methods in the spherical domain do not exploit inter-frame correlations to reduce geometry redundancy. To address these limitations, we propose a learning-based inter-frame predictive coding method, termed Inter-LPCM. For azimuth prediction, we employ a delta coding strategy based on the predefined angular resolution. To improve radius compression, we introduce an inter-frame radius predictive (Inter-RP) model that estimates the current point’s radius using neighboring points from both the current frame and the registered reference frame. In addition, we design a lightweight attention-based prediction (LAEP) model to predict elevation angles by capturing long-range geometric correlations across different coordinates. For quantization, we propose an RD-optimized method to select quantization steps in the spherical coordinate system. For entropy coding, we design distinct models for each spherical coordinate component. These models are adapted to the statistical priors of each coordinate, enabling more accurate probability estimation. Our source code is publicly available at this https URL
[CV-341] See Silhouettes in Motion with Neuromorphic Vision
链接: https://arxiv.org/abs/2605.17984
作者: Pei Zhang,Shijie Lin,Zhou Ge,Jinpeng Chen,Wei Pu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 12 pages, 12 figures, and 3 tables. This work is under review. Project page: this https URL
Abstract:Quasi-bimodal objects, such as text, road signs, and barcodes, play a basic yet vital role in daily visual communication. By boiling these down to clear silhouettes, binarization uses a minimal language to convey essential vision cues for maximum downstream efficiency. The catch is that frame-based imaging often struggles on mobile platforms like drones, self-driving cars, and underwater vehicles. In these dynamic scenes, rapid motion and harsh lighting can make it blind, causing severe motion blur and erasing crucial details. To overcome the limits, neuromorphic vision via event cameras, featuring microsecond-level temporal resolution and high dynamic range, steps in as a natural solution. Building upon this event-driven sensing paradigm, we introduce a simple yet effective dual-modal approach that harnesses the synergy between frames and events to achieve real-time, high-frame-rate binarization on CPU-only devices. Extensive evaluations present that it earns competitive performance against leading techniques in reducing motion blur, while delivering impressive improvements under challenging illumination. Besides, our asynchronous workflow bypasses event scarcity that breaks traditional time-binning reconstruction, maintaining clear target shapes even at extreme kilohertz frame rates. Its binary results further serve as reliable representations that facilitate a range of downstream tasks. This work paves the way towards lightweight perception and interaction in embodied intelligence on resource-constrained edge platforms.
[CV-342] Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures ICML2026
链接: https://arxiv.org/abs/2605.17850
作者: Chenyang Wang,Weizhong Wang,Yinuo Ren,Jose Blanchet,Yiping Lu
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
备注: accepted by ICML 2026
Abstract:iffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \textttURGE, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \textttURGE attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \textttURGE outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.
[CV-343] MIRAG E: Robust multi-modal architectures translate fMRI-to-image models from vision to mental imagery
链接: https://arxiv.org/abs/2605.17198
作者: Reese Kneeland,Cesar Kadir Torrico Villanueva,Jordyn Ojeda,Shuhb Khanna,Jonathan Xu,Paul S. Scotti,Thomas Naselaris
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To be useful for downstream applications, vision decoding models that are trained to reconstruct seen images from human brain activity must be able to generalize to internally generated visual representations, i.e., mental images. In an analysis of the recently released NSD-Imagery dataset, we demonstrated that while some modern vision decoders can perform quite well on mental image reconstruction, some fail, and that state-of-the-art (SOTA) performance on seen image reconstruction is no guarantee of SOTA performance on mental image reconstruction. Motivated by these findings, we developed MIRAGE, a method explicitly designed to train on vision datasets and cross-decode mental images from brain activity. MIRAGE employs a linear backbone and multi-modal text and image features as input to a diffusion model. Feature metrics and human raters establish MIRAGE as SOTA for mental image reconstruction on the NSD-Imagery benchmark. With ablation analysis we show that mental image reconstruction works best when decoders use image features with relatively few dimensions and include guidance from text-based and both high- and low-level image-based features. Our work indicates that–given the right architecture–existing large-scale datasets using external stimuli are viable training data for decoding mental images, and warrant optimism about the future success and utility of mental image reconstruction.
[CV-344] Adaptive Fused Prior Transfer for Controllable Generative Image Compression
链接: https://arxiv.org/abs/2605.16817
作者: Yifei Pei,Ying Liu,Nam Ling
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures. This work has been submitted to IEEE Access for possible publication. Code is available at this https URL
Abstract:Learned image compression has achieved competitive rate-distortion performance, but very-low-bitrate reconstruction remains difficult because the transmitted representation often cannot preserve fine textures and local structures. Perceptual and generative codecs address this problem by using learned reconstruction priors, and controllable codecs allow one model to cover different bitrate and reconstruction preferences. However, controllability alone does not resolve the decoder-side reconstruction-prior problem: under severe bit constraints, the decoder must infer missing details from limited transmitted information, while existing codebook-based controllable designs generally rely on single-codebook token-based priors. This paper proposes Adaptive Fused Prior Transfer for Controllable Generative Image Compression (AFP-GIC), a controllable codec that transfers an adaptive fused prior from a frozen pretrained AdaCode model. Encoder-side fused-prior features guide latent formation, while the decoder predicts a compatible fused prior from the compressed representation and selected control variables, enabling prior-guided reconstruction without transmitting the fused prior itself. A motivating analysis relates decoder-side fused-prior alignment to a reconstruction-error upper bound and shows that the fused-prior family contains single-codebook choices as special cases. Under the unified benchmark, AFP-GIC reduces decoder latency by 18.1% and the overall parameter count by 31.10 million (20.5%) relative to DC-VIC. Experiments on Kodak, CLIC2020, and DIV2K show competitive PSNR, with the clearest perceptual gains in NIQE scores and very-low-bitrate visual comparisons.
[CV-345] Deep Learning for MRI Slice Interpolation: The Critical Role of Problem Formulation
链接: https://arxiv.org/abs/2605.16476
作者: Shamit Savant
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages main text, 21 pages total with supplementary, 8 figures, supplementary material included
Abstract:Through-plane resolution in clinical MRI is typically much coarser than in-plane resolution, limiting diagnostic utility. This work investigates deep learning approaches to interpolate intermediate MRI slices in prostate imaging, effectively doubling through-plane resolution. I evaluated five architectures (CNN, U-Net, two GAN variants, and DDPM) and discovered that problem formulation has dramatically more impact than architectural complexity. By reformulating the interpolation task to use adjacent slices (i-1, i+1) rather than distant slices (i-2, i+2), I achieved a 58% improvement in SSIM performance across all deterministic architectures. The U-Net model achieved the best results with PSNR of 30.08 dB and SSIM of 0.898, representing a 10.1% improvement over linear interpolation baseline. A DDPM was also evaluated but showed poor reconstruction quality due to fundamental mismatch between stochastic generation and deterministic reconstruction requirements. These findings demonstrate that problem formulation can have 290x more impact than architectural sophistication in medical imaging tasks.
[CV-346] Flow Matching with Optimized Subclass Priors for Medical Image Augmentation
链接: https://arxiv.org/abs/2605.16469
作者: Felix Nützel,Mischa Dombrowski,Bernhard Kainz
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 7 tables
Abstract:Rare diseases dominate the diagnostic challenge in medical imaging yet are severely underrepresented in clinical datasets, causing classifiers to fail on exactly the conditions where reliable detection matters most. Generative augmentation can supply the missing tail-class coverage, but coarse disease labels aggregate diverse subtypes and acquisition settings into multi-modal conditionals that bias generators toward dominant submodes, while a shared Gaussian source forces rare subpopulations through disproportionately long transport paths. We propose an offline strategy that introduces informative priors at two levels: first, we partition each coarse label into coherent submodes via Gaussian mixture modeling in the generative model’s latent space; second, we learn subclass-conditioned source distributions that re-center and re-scale the starting distribution per submode, shortening trajectories and reducing within-subclass dispersion. To prevent degenerate solutions we impose explicit geometric control, moderately concentrating normalized displacement directions around learnable prototypes while capping path-length outliers. On long-tailed chest X-ray (MIMIC-LT, NIH-LT) and CT slice (CT-RATE) benchmarks the proposed method consistently improves tail-class generation fidelity and diversity (FID, IRS) and is a promising augmentation strategy that reliably improves downstream balanced accuracy and macro-F1 over a non-augmented baseline across modalities.
[CV-347] Bridging the Modality Bottleneck in Pathology MIL through Virtual Molecular Staining
链接: https://arxiv.org/abs/2605.16392
作者: Yucheng Xing,Pei Liu,Jingying Ma,Ruping Hong,Jiangdong Qiu,Tianyu Liu,Kai He,Ling Huang,Mengling Feng
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multiple instance learning (MIL) is the dominant framework for whole-slide image analysis in computational pathology, typically combining a frozen patch encoder, a projection layer, and a slide-level aggregator. While encoders and aggregators have been extensively studied, the projection layer remains a largely morphology-only bottleneck. This limits endpoints such as biomarker status and survival, which are governed by a molecular state that is not fully captured by HE morphology. We introduce Molecularly Informed Staining Transform (MIST), a plug-in replacement for the MIL projection layer that uses paired spatial transcriptomics only during training to construct virtual molecular stains. MIST clusters gene expression profiles into cross-modal prototypes, anchors them in the frozen foundation model feature space, and uses them to reorganize HE patch features along molecularly guided axes. It requires no transcriptomics at inference and can be inserted before standard MIL aggregators. We evaluate MIST across 23 downstream tasks and 8 MIL aggregators. MIST improves 240 of 256 configurations over the standard projection layer, with an average gain of +3.5%, observed consistently across endpoint types: +5.2% on survival prediction, +3.3% on tissue subtyping, and +2.6% on biomarker prediction. Ablations confirm that gene-derived prototypes are the primary source of the gains, while spatial, biological, and pathological analyses show that cross-modal prototype affinities capture spatially coherent molecular programs from HE alone.
[CV-348] Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG
链接: https://arxiv.org/abs/2605.16376
作者: Marco Graziano
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.
[CV-349] Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization
链接: https://arxiv.org/abs/2605.16357
作者: Tzu-Ti Wei,Po-Cheng Chen,Yu-Chee Tseng,Jen-Jee Chen
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:WiFi fingerprint-based indoor localization has been widely studied, but most existing approaches focus on absolute positioning and rely on dense coordinate annotations, which are costly to obtain at scale. In this paper, we study a fundamentally different problem: relative localization, where the goal is to directly estimate the displacement between two WiFi fingerprint traces without predicting their absolute positions. To reduce annotation overhead, we adopt weak supervision in the form of stepwise motion vectors obtained from inertial sensing. We propose Intersection Pathway (IP), a cross-modal learning framework that aligns fingerprint traces (f-traces) and displacement traces (d-traces) in a shared latent space. The key idea is to enforce an additive structure in the latent space, such that latent addition and subtraction correspond to physical motion composition, enabling direct relative-displacement inference. Experiments on a synthesized dataset derived from real measurements demonstrate that the proposed method learns displacement-aware WiFi representations and achieves accurate relative localization across varying displacement ranges. Furthermore, the learned model can be extended to few-shot absolute localization with sparse anchors.
人工智能
[AI-0] Actionable World Representation
链接: https://arxiv.org/abs/2605.18743
作者: Kunqi Xu,Jitao Li,Jianglong Ye,Tianshu Tang,Isabella Liu,Sifei Liu,Xueyan Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.
[AI-1] What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models
链接: https://arxiv.org/abs/2605.18738
作者: Payal Chandak,Victoria Alkin,David Wu,Maya Dagan,Taposh Dutta Roy,Maria Clara Saad Menezes,Ayush Noori,Nirali Somia,John S. Brownstein,Ran Balicer,Rebecca W. Brendel,Noa Dagan,Isaac S. Kohane,Gabriel A. Brat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code and data available upon request via this https URL
Abstract:Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient’s values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.
[AI-2] DexHoldem: Playing Texas Holdem with Dexterous Embodied System
链接: https://arxiv.org/abs/2605.18727
作者: Feng Chen,Tianzhe Chu,Li Sun,Pei Zhou,Zhuxiu Xu,Shenghua Gao,Yuexiang Zhai,Yanchao Yang,Yi Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 30 Pages
Abstract:Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold’em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold’em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, \pi_0.5 obtains the highest task completion rate ( 61.2% ), while \pi_0.5 and \pi_0 tie on scene-preserving success rate ( 47.5% ). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ( 34.3% ), while GPT 5.5 obtains the best average field-wise accuracy ( 66.8% ), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: this https URL.
[AI-3] Distilling Tabular Foundation Models for Structured Health Data
链接: https://arxiv.org/abs/2605.18702
作者: Aditya Tanna,Nassim Bouarour,Mohamed Bouadi,Vinay Kumar Sankarapu,Pratinav Seth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and several multi-teacher ensembles, we find that distilled students retain at least 90% of teacher AUC, outperforming teachers in some cases, while running at least 26\times faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.
[AI-4] PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
链接: https://arxiv.org/abs/2605.18697
作者: Stephen Mell,David Mell,Konstantinos Kallas,Steve Zdancewic,Osbert Bastani
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to 6.4\times speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2605.18697 [cs.DC] (or arXiv:2605.18697v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.18697 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-5] Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap
链接: https://arxiv.org/abs/2605.18696
作者: Aditya Tanna,Yash Desai,Pratinav Seth,Mohamed Bouadi,Nassim Bouarour,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961 , close enough to 1 that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys +0.18% accuracy over the strongest single TFM at 253\times the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emphworse than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.
[AI-6] SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
链接: https://arxiv.org/abs/2605.18693
作者: Yifan Zhou,Zhentao Zhang,Ziming Cheng,Shuo Zhang,Qizhen Lan,Zhangquan Chen,Zhi Yang,QianyuXu,Ronghao Chen,Huacan Wang,Sen Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.
[AI-7] Democratizing Large-Scale Re-Optimization with LLM -Guided Model Patches
链接: https://arxiv.org/abs/2605.18692
作者: Tinghan Ye,Arnaud Deza,Ved Mohan,El Mehdi Er Raqabi,Pascal Van Hentenryck
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules, previously overlooked constraints, and unforeseen perturbations. In such contexts, end users must rapidly re-optimize models to recover feasible and implementable solutions. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.
[AI-8] Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
链接: https://arxiv.org/abs/2605.18684
作者: Sanderson Oliveira de Macedo,Ronaldo Martins da Costa
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint. Includes a generative AI use statement
Abstract:Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language-model-based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for AI agents. Reversa organizes this process as a multi-agent pipeline: specialized agents map the project surface, analyze modules, extract implicit rules, synthesize architecture, write unit-level specifications, and review generated claims. The proposal emphasizes three mechanisms: traceability between code and specification, explicit confidence marking, and preservation of gaps for human validation. The framework is distributed as a this http URL CLI, installs skills across multiple agent engines, and uses a SHA-256 manifest to preserve modified files during update or uninstall operations. In addition to the architectural description, we report an exploratory case study on migrating an ATM from COBOL to Go, in which the pipeline produced 517 claims classified by an internal confidence index, 10 registered gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9 of 11 tasks completed at inventory time. Final parity validation and cutover were not completed in this study. We do not claim broad empirical superiority; we position the contribution with respect to the literature on reverse engineering, LLM-based documentation, and software agents, and propose an evaluation protocol with metrics for coverage, traceability, confidence, utility, and cost. Comments: Preprint. Includes a generative AI use statement Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18684 [cs.SE] (or arXiv:2605.18684v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.18684 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sanderson Macedo Doc [view email] [v1] Mon, 18 May 2026 17:23:13 UTC (1,130 KB) Full-text links: Access Paper: View a PDF of the paper titled Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents, by Sanderson Oliveira de Macedo and Ronaldo Martins da CostaView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-9] Learning Quantifiable Visual Explanations Without Ground-Truth
链接: https://arxiv.org/abs/2605.18681
作者: Amritpal Singh,Andrey Barsky,Mohamed Ali Souibgui,Ernest Valveny,Dimosthenis Karatzas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model’s decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model’s decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.18681 [cs.AI] (or arXiv:2605.18681v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.18681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] COOPO: Cyclic Offline-Online Policy Optimization Algorithm
链接: https://arxiv.org/abs/2605.18675
作者: Qisai Liu,Zhanhong Jiang,Joshua Russell Waite,Aditya Balu,Cody Fleming,Soumik Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.
[AI-11] Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning
链接: https://arxiv.org/abs/2605.18674
作者: Michael Aichmüller,Simon Ståhlberg,Martin Funkquist,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can “jump” over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the C_2 logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.
[AI-12] Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
链接: https://arxiv.org/abs/2605.18672
作者: S.Bensalem,Y. Dong,M. Franzle,X. Huang,J. Kroger,D. Nickovic,A. Nouri,R. Roy,C. Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents – a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation – semantic intent and policy compliance, environmental validity, and dynamical feasibility – each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract-based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer’s assumption. We sketch such an architecture and derive the compositional system-level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non-i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi-agent settings – the most important unfinished business in LLM agent runtime assurance.
[AI-13] AI for Auto-Research: Roadmap User Guide
链接: https://arxiv.org/abs/2605.18661
作者: Lingdong Kong,Xian Sun,Wei Chow,Linfeng Li,Kevin Qinghong Lin,Xuan Billy Zhang,Song Wang,Rong Li,Qing Wu,Wei Gao,Yingshuo Wang,Shaoyuan Xie,Jiachen Liu,Leigang Qu,Shijie Li,Lai Xing Ng,Benoit R. Cottereau,Ziwei Liu,Tat-Seng Chua,Wei Tsang Ooi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project Page at this https URL GitHub Repo at this https URL
Abstract:AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as 15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding experiments, tables figures), Writing (paper writing), Validation (peer review, rebuttal revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.
[AI-14] KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture
链接: https://arxiv.org/abs/2605.18657
作者: Luis Balderas,José Alberto Rodríguez,Miguel Lastra,Antonio Arauzo-Azofra,José M. Benítez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.
[AI-15] Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees
链接: https://arxiv.org/abs/2605.18654
作者: Aditya Tanna,Nassim Bouarour,Mohamed Bouadi,Vinay kumar Sankarapu,Pratinav Seth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data ( 21 features: +0.011 over CatBoost vs. 21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.
[AI-16] Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models
链接: https://arxiv.org/abs/2605.18635
作者: Aditya Tanna,Mitul Solanki,Mohamed Bouadi,Nassim Bouarour,Pratinav Seth,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.
[AI-17] Position: Weight Space Should Be a First-Class Generative AI Modality
链接: https://arxiv.org/abs/2605.18632
作者: Zhangyang Wang,Peihao Wang,Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AI systems routinely improve or create other AI systems
Abstract:Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community’s default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.
[AI-18] SCICONVBENCH: Benchmarking LLM s on Multi-Turn Clarification for Task Formulation in Computational Science
链接: https://arxiv.org/abs/2605.18630
作者: Nithin Somasekharan,Youssef Hassan,Shiyao Lin,Gihan Panapitiya,Patrick Emami,Anurag Acharya,Sameera Horawalavithana,Shaowu Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at this https URL.
[AI-19] Learning Lifted Action Models from Traces with Minimal Information About Actions and States KR2026
链接: https://arxiv.org/abs/2605.18627
作者: Jonas Gösgens,Niklas Jansen,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted at KR2026
Abstract:It has been recently shown that lifted STRIPS models can be learned correctly and efficiently from action traces alone; i.e., applicable action sequences from a hidden STRIPS model. The result is remarkable because the states are not assumed to be observable at all, and yet it is not practical enough as STRIPS actions include arguments that are not needed for selecting the actions. This shortcoming has been addressed by assuming that the action traces come instead from a hidden STRIPS+ model where some action arguments are implicit in the hidden action preconditions. A limitation of this approach, however, is that it assumes that the states are fully observable. In this work, we relax these restrictions and consider the problem of learning STRIPS+ action domains from traces in a more general context where the traces carry partial information about both actions and states. In particular, we formulate algorithms and completeness results for three general cases, all of which assume full observability of selected action arguments. In the first case, no observability of the state is assumed; in the second case, full observability of some state predicates is assumed, and in the third case, local observability of some state predicates is assumed instead. Given a STRIPS+ domain, these results characterize the conditions under which an equivalent domain can be learned from traces. Experimental results are reported.
[AI-20] Stochastic Penalty-Barrier Methods for Constrained Machine Learning
链接: https://arxiv.org/abs/2605.18618
作者: Adam Bosák,Andrii Kliachkin,Jana Lepšová,Gilles Bareilles,Jakub Mareček
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Constrained machine learning enables fairness-aware training, physics-informed neural networks, and integration of symbolic domain knowledge into statistical models. Despite its practical importance, no general method exists for the non-convex, non-smooth, stochastic setting that arises naturally in deep learning. We propose the Stochastic Penalty-Barrier Method (SPBM), which extends classical penalty and barrier methods to this setting via exponential dual averaging, a~stabilized penalty schedule, and the Moreau envelope to handle non-smoothness. Experiments across multiple settings show that SPBM matches or outperforms existing constrained optimization baselines while incurring only linear runtime overhead compared to unconstrained Adam for up to 10,000 constraints.
[AI-21] SAME: A Semantically-Aligned Music Autoencoder
链接: https://arxiv.org/abs/2605.18613
作者: Julian D. Parker,Zach Evans,CJ Carr,Zachary Zukowski,Josiah Taylor,Matthew Rice,Jordi Pons
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096 \times temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
[AI-22] Latent Action Reparameterization for Efficient Agent Inference
链接: https://arxiv.org/abs/2605.18597
作者: Wenhao Huang,Qingwen Zeng,Qiyue Chen,Zijie Guo,Yu Sun,Cheng Yang,Siru Ouyang,Jiri Gesi,Fang Wu,Jiayi Zhang,Huaming Chen,Bang Liu,Xiangru Tang,Chenglin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.
[AI-23] Not What You Asked For: Typographic Attacks in Household Robot Manipulation
链接: https://arxiv.org/abs/2605.18593
作者: Ali Iranmanesh,Peng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 1 figure, IEEE conference format
Abstract:Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization. Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined. Comments: 10 pages, 1 figure, IEEE conference format Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2605.18593 [cs.CR] (or arXiv:2605.18593v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.18593 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-24] Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation ICML2026
链接: https://arxiv.org/abs/2605.18591
作者: Mingfei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural policy gradients as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.
[AI-25] When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
链接: https://arxiv.org/abs/2605.18580
作者: Peiying Zhu,Sidi Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.8; I.2.11; J.4 Cite as: arXiv:2605.18580 [cs.AI] (or arXiv:2605.18580v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.18580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-26] Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning
链接: https://arxiv.org/abs/2605.18570
作者: Yan Jiao,Jingran Xu,Pin-Han Ho,Limei Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a static matching problem, ignoring query context and cross-system asymmetry. This limitation is particularly critical in integrative medical settings, where correspondence between concepts is inherently context-dependent, non-bijective, and direction-sensitive. In this paper, we propose Query-Conditioned Entity Alignment (QCEA), which reformulates entity alignment as a query-conditioned correspondence problem. Instead of learning a fixed mapping between entity representations, QCEA treats the textual description of a source entity as a query and ranks candidate entities in the target graph, enabling context-dependent alignment. The framework integrates semantic encoding, graph-based representation learning, and a direction-aware transformation module to capture asymmetric and many-to-many correspondence across heterogeneous knowledge systems. We evaluate QCEA on TCM–WM knowledge graphs derived from SymMap, covering both symptom alignment and herb–molecule alignment tasks. Experimental results show consistent improvements over representative baselines, particularly on rank-sensitive metrics such as Hit@K and MRR. Furthermore, downstream retrieval-augmented generation (RAG) experiments demonstrate that improved alignment leads to better evidence retrieval, stronger grounding, and higher answer accuracy. These findings highlight that alignment is not merely a data integration step, but a key factor that shapes knowledge accessibility and reliability in cross-system medical reasoning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18570 [cs.AI] (or arXiv:2605.18570v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.18570 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-27] Key-Gram: Extensible World Knowledge for Embodied Manipulation
链接: https://arxiv.org/abs/2605.18556
作者: Jingjing Fan,Siyuan Li,Botao Ren,Zhidong Deng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures
Abstract:Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its O(1) lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both \pi_0 and \pi_0.5 backbones, with average relative gains of 29.5%/9.9% on RoboTwin2.0, 35.8%/4.5% on LIBERO-Plus transfer without target-domain fine-tuning, and 15.4%/8.1% on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.
[AI-28] VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
链接: https://arxiv.org/abs/2605.18547
作者: Linan ZHU,Zihao Zhai,Xiao Han,Yuqian Fu,Xiangfan Chen,Xiangjie Kong,Guojiang Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers’ emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker’s emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at this https URL.
[AI-29] Probing for Representation Manifolds in Superposition
链接: https://arxiv.org/abs/2605.18537
作者: Alexander Modell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 19 pages, 7 figures
Abstract:This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model’s completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.
[AI-30] AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
链接: https://arxiv.org/abs/2605.18529
作者: Zhenlin Wei,Pu Jian,Yingzhuo Deng,Xiaohan Wang,Jiajun Chai,Zhexin Hu,Wei Lin,Shanbin Zhang,Guojun Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals – from verifier outcomes, peer rollouts, or reference feedback – into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.
[AI-31] A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy
链接: https://arxiv.org/abs/2605.18511
作者: David Martin-Calle(ILM,UCBL,CNRS),Cesar Alvarez Llamas(ILM,UCBL,CNRS),Vincent Motto- Ros(ILM,UCBL,CNRS),Christophe Dujardin(ILM,UCBL,CNRS,IUF),Jérémie Margueritat(ILM,UCBL,CNRS),David Rodney(ILM,UCBL,CNRS)
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Signal Processing (eess.SP)
备注:
Abstract:A lightweight and reproducible denoising pipeline for high-throughput Raman spectroscopy is presented. The approach relies on a one-dimensional convolutional autoencoder trained using a Noise2Noise strategy, requiring neither external spectral libraries nor high signal-to-noise reference spectra for training. From a reduced training subset composed of repeated short-exposure acquisitions, the model learns to reconstruct Raman spectra while efficiently suppressing stochastic noise. The method is evaluated on a heterogeneous mineral sample, using both quantitative spectral fidelity metrics (RMSE, SNR, SSIM) and task-oriented criteria based on unsupervised K-means classification. Results demonstrate that integration times as short as 5 ms per spectrum, which are typically insufficient for reliable interpretation, yield denoised spectra with high fidelity to the reference data while preserving chemically coherent maps. This work provides a practical trade-off between spectral quality and acquisition speed, enabling fast, adaptable Raman workflows compatible with routine laboratory use. It also offers a transferable framework for other one-dimensional spectroscopic modalities.
[AI-32] DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization
链接: https://arxiv.org/abs/2605.18508
作者: Chengpeng Hu,Yingqian Zhang,Hendrik Baier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Programmatic reinforcement learning (PRL) offers an interpretable alternative to deep reinforcement learning by representing policies as human-readable and -editable programs. While gradient-based methods have been developed to optimize continuous relaxations of programs, they face a significant performance drop when converting the continuous relaxations back into discrete programs. Post-hoc discretization can discard optimized branches and parameters in a program, which results in a collapse of policy expressivity and lowered task performance, leading in turn to a need for additional fine-tuning. To overcome these limitations, we propose Differentiable Discrete Programmatic Reinforcement Learning (DiPRL), a method that learns programmatic policies that become nearly discrete during training, avoiding a separate post-hoc fine-tuning stage. We first analyze the inherent risks of performance drop introduced by post-hoc discretization of gradient-based methods. Then, we introduce programmatic architecture entropy regularization, which enables smooth, differentiable training that encourages convergence toward a discrete program. DiPRL maintains the efficiency of gradient-based optimization while mitigating the risks of post-hoc discretization. Our experiments across multiple discrete and continuous RL tasks demonstrate that DiPRL can achieve strong performance via interpretable programmatic policies.
[AI-33] DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs
链接: https://arxiv.org/abs/2605.18498
作者: Jing Wang,Hongxuan Lu,Jazze Young,Shu Wang,Zhimin Xin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18498 [cs.LG] (or arXiv:2605.18498v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-34] Modality vs. Morphology: A Framework for Time Series Classification for Biological Signals
链接: https://arxiv.org/abs/2605.18483
作者: Jordan Tschida,Matthew Yohe,Edward Kane,Gavin Jager,Emma J. Reid,Tony G. Allen,Mark Story,Leanne Thompson,Joe Hoskins,Brandon Schreiber,Stan Seiferth,Scott Dolvin,David Cornett
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series classification (TSC) of biological signals has progressed from handcrafted, modality-specific approaches to deep architectures capable of representing the diverse waveform structures of underlying physiological processes (i.e., morphology). This review introduces a unified morphology–modality framework that connects waveform structure to a methodological design, revealing how spikes, bursts, oscillations, slow drift, and hierarchical rhythms inform model design. By analyzing electroencephalography, electromyography, electrocardiography, photoplethysmography, and ocular modalities (electrooculography, pupillometry, eye-tracking), the review demonstrates how morphology determines preprocessing and modeling strategies. Integrating evidence across these biological signals, the framework reveals that morphology, not model class, most strongly determines performance and interpretability. This provides insight into why deep models succeed when their inductive biases align with underlying waveform dynamics. This review also identifies future work including morphological data augmentation and evaluation metrics to improve generalization. Together, these insights position morphology-aware modeling as a unifying principle for developing generalizable, interpretable, and physiologically meaningful TSC models across biological signals.
[AI-35] OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
链接: https://arxiv.org/abs/2605.18481
作者: Chiara Maria Russo,Simone Carnemolla,Simone Palazzo,Daniela Giordano,Concetto Spampinato,Matteo Pennisi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Interpreting the decisions of deep image classifiers remains challenging, particularly in black-box settings where model internals are inaccessible. We introduce OCCAM, a framework for open-set causal concept explanation and ontology induction in vision models. OCCAM discovers visual concepts in an open-set manner, localizes them via text-guided segmentation, and performs object-level interventions by removing concepts to measure changes in class confidence, estimating each concept’s causal contribution. Beyond local explanations, OCCAM aggregates interventional evidence across a dataset to induce a structured concept ontology that captures how classifiers globally organize visual concepts. Reasoning over this ontology reveals consistent dependencies between concepts, exposes latent causal relations, and uncovers systematic model biases. Experiments on Broden and ImageNet-S across multiple classifiers show that OCCAM improves explanation quality in open-set black-box settings while providing richer global insight than per-image attribution methods.
[AI-36] GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets
链接: https://arxiv.org/abs/2605.18475
作者: Zhangyang Yao,Haiyan Zhao,Haoyu Wang,Tianbo Huang,Lihua Zhang,Xu Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed-precision quantization improves the budget–accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B–32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.
[AI-37] When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization
链接: https://arxiv.org/abs/2605.18460
作者: MKA Ariyaratne,Azwirman Gusrialdi,Yury Nikulin,Jaakko Peltonen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 34 pages, 19 Figures
Abstract:This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm’s robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.
[AI-38] Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework
链接: https://arxiv.org/abs/2605.18454
作者: Chengpeng Hu,Yingqian Zhang,Hendrik Baier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at this https URL.
[AI-39] Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights AAMAS2026
链接: https://arxiv.org/abs/2605.18449
作者: Ken Ming Lee,Paul Barde,Maxime C. Cohen,Derek Nowrouzezahrai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.
[AI-40] Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations
链接: https://arxiv.org/abs/2605.18444
作者: Masoud Heidary,Biresh Kumar Joardar
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Hardware aging poses a significant challenge for integrated circuits (ICs), leading to performance degradation and eventual failure. In this work, we focus on the aging of arithmetic multipliers, which are a cornerstone of modern computing systems including in CPUs, GPUs, and FPGAs, as well as AI accelerators like systolic arrays. In particular, AI workloads, which rely predominantly on multiplications, can accelerate Negative Bias Temperature Instability (NBTI) effects in multipliers. This paper presents a novel aging mitigation technique that leverages the signinvariance property of multiplication. By selectively applying 2s complement transformations to inputs, the method redistributes stress across transistors, reducing the effects of NBTI aging. The proposed method is also integrated into systolic arrays, a common AI accelerator, to demonstrate its efficiency in a high-throughput AI accelerator. Experimental evaluations using Cadence tools show better lifetime compared to natural aging (with no mitigation) baseline, while introducing negligible area and delay overheads.
[AI-41] Prompts Dont Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control EMNLP2026
链接: https://arxiv.org/abs/2605.18414
作者: Rohith Uppala
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 tables, 1 figure. Planning to submit to EMNLP 2026 Industry Track
Abstract:Large language models increasingly operate as autonomous agents that select and invoke tools from large registries. We identify a critical gap: when unauthorized tools are visible in an agent’s context, models select them in adversarial scenarios – even when explicitly instructed otherwise. We propose a governed MCP proxy that enforces attribute-based access control (ABAC) at two points: tool discovery, where unauthorized tools are removed from the model’s context window, and tool invocation, where a second check blocks any unauthorized call. Across three models (Qwen 2.5 7B, Llama 3.1 8B, Claude Haiku 3.5) and 150 adversarial tasks spanning four attack categories, our proxy reduces unauthorized invocation rate (UIR) to 0% while adding under 50ms median latency. Prompt-based restrictions reduce UIR by only 11–18 percentage points, leaving substantial residual risk. Our results show that architectural enforcement – not prompting – is necessary for reliable tool access control in deployed agentic systems.
[AI-42] Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation
链接: https://arxiv.org/abs/2605.18395
作者: Sungwoo Kang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit systematic political biases in voter simulations, but their underlying mechanisms and cross-lingual generalizations remain poorly understood. We introduce Dynamo-K, a census-grounded simulation framework evaluating Korean-language LLM political behavior across four models on six Korean elections (2017-2025). Using this framework, we identify three systematic failure modes: (1) progressive bias in moderate agents, where explicit mitigation reduces Mean Absolute Error (MAE) by 5.2 times; (2) model-dependent third-party salience collapse, distinguishing between salience failure and decision bias; and (3) regional polarization collapse, where models bidirectionally under-predict historical party strongholds. To address these failures, we demonstrate that scenario reframing recovers 62% of 2017 MAE by restoring third-party visibility. Furthermore, we introduce a learned reweighting adapter that successfully calibrates opposing-valence models without relying on candidate names at train or test time. Validating our diagnostic framework, Dynamo-K accurately predicts 3/3 presidential winners - including a 2.1%p MAE on the highly contested 0.73%p-margin 2022 race - and correctly identifies the dominant party in a held-out local election. The pipeline is open-source and provides a scalable, cost-effective method for diagnosing LLM political behavior. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18395 [cs.CY] (or arXiv:2605.18395v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.18395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-43] Graph Hierarchical Recurrence for Long-Range Generalization
链接: https://arxiv.org/abs/2605.18387
作者: Stefano Carotti,Marco Pacini,Alessio Gravina,Davide Bacciu,Bruno Lepri,Sebastiano Bontorin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.
[AI-44] owards Ubiquitous Mapping and Localization for Dynamic Indoor Environments
链接: https://arxiv.org/abs/2605.18385
作者: Halim Djerroud,Nico Steyn,Olivier Rabreau,Patrick Bonnin,Abderraouf Benali
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) MSC classes: 68T40, 68T45, 93E11, 90C10 ACMclasses: I.2.9; I.2.10; I.4.8 Cite as: arXiv:2605.18385 [cs.RO] (or arXiv:2605.18385v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.18385 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X Related DOI: https://doi.org/10.5220/0013245400003890 Focus to learn more DOI(s) linking to related resources Submission history From: Halim Djerroud [view email] [v1] Mon, 18 May 2026 13:28:23 UTC (300 KB) Full-text links: Access Paper: View a PDF of the paper titled Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments, by Halim Djerroud and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.RO prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-45] QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
链接: https://arxiv.org/abs/2605.18380
作者: Anthony G. Cohn,Robert E. Blackwell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 74 pages, 20 figures
Abstract:We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen’s Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.
[AI-46] Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers
链接: https://arxiv.org/abs/2605.18374
作者: Soheyl Massoudi,Gabriel Apaza,Milad Habibi,Mark Fuge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.
[AI-47] Same Signal Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents
链接: https://arxiv.org/abs/2605.18332
作者: Wei Ma,Zhi Chen,Jingxu Gu,Tianling Li,Shangqing Liu,Lingxiao Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM’s 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18332 [cs.SE] (or arXiv:2605.18332v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.18332 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wei Ma [view email] [v1] Mon, 18 May 2026 12:49:18 UTC (125 KB)
[AI-48] Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows
链接: https://arxiv.org/abs/2605.18327
作者: Dhairya Dalal,Endre Sara,Ben Yemini,Christine Miller,Shmuel Kliger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63%, mean token consumption by 60%, and mean tool-call count by 78%, compressing the investigation footprint by 4.8 \times and lowering direct API cost per run by 57%; root-cause-diagnosis accuracy rises from 75% to 100%.
[AI-49] ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization
链接: https://arxiv.org/abs/2605.18320
作者: Yifei Chen,Shaoqin Zhu,Xiaoqiang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism “densifies” high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.
[AI-50] Alignment Dynamics in LLM Fine-Tuning
链接: https://arxiv.org/abs/2605.18309
作者: Yuhan Huang,Huanran Chen,Yinpeng Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf\colorred!60!black Rebound Force, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf\colorgreen!60!black Driving Force, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbfRehearsal Priming Effect: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.
[AI-51] CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories
链接: https://arxiv.org/abs/2605.18284
作者: Divya Chukkapalli,Thejesh Avula,Aditya Aggarwal,Harsimran Singh,Amith Tallanki
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull-request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed-memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository’s own git history under a constrained regime: deterministic, dependency-free, local-only, no embeddings. We present CommitDistill, an open-source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out-of-distribution queries. The artefact is a trust-instrumented memory substrate: deterministic, no external service, inspectable plain-JSON store, tunable abstention. A case study on five public repositories spanning Python, JavaScript, C, and Java (25,000 commits, 1,167 extracted units) reports useful-precision 0.525 at Cohen’s kappa = 0.633 on 40 dual-annotated Python units. The decisive finding is budget-constrained retrieval: at a 256-character per-query budget, CommitDistill reaches 0.750 hit-rate on a 12-query benchmark against BM25’s 0.333 and git log --grep’s 0.083. On a four-arm paired LLM-as-judge evaluation (n=200 time-travel bug-fixes, two judges) covering control, CommitDistill, a body-budget-matched CD-Hybrid, and BM25, no condition produces a statistically detectable lift over control on the headline mean and CD-Hybrid is indistinguishable from BM25 head-to-head. Extraction over 10,000 commits completes in under 4 seconds on a laptop. Source, annotations, baselines, and a reproducibility script accompany this paper.
[AI-52] Privacy Preserving Reinforcement Learning with One-Sided Feedback IJCAI ECAI2026
链接: https://arxiv.org/abs/2605.18246
作者: Lin William Cong,Guangyan Gan,Hanzhang Qin,Zhenzhen Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI-ECAI 2026
Abstract:We study reinforcement learning (RL) in multi-dimensional continuous state and action spaces with one-sided feedback, where the agent receives partial observations of the state and obtains reward information for only a subset of the state-action space at each time step. This setting introduces substantial challenges in both learning efficiency and privacy preservation. To address these challenges, we propose POOL, a novel privacy-preserving RL algorithm. We conduct a comprehensive theoretical analysis of POOL, deriving a sample complexity bound that matches the known lower bounds for non-private RL. Here, E_rho denotes the privacy parameter, H is the time horizon, and alpha is the optimality-gap parameter. Our findings show that it is possible to enforce strong privacy guarantees while maintaining high learning efficiency, marking a significant step toward practical, privacy-aware RL in multi-dimensional environments with one-sided feedback.
[AI-53] Are Sparse Autoencoder Benchmarks Reliable?
链接: https://arxiv.org/abs/2605.18229
作者: David Chanin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of k -sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.
[AI-54] A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders
链接: https://arxiv.org/abs/2605.18224
作者: Zegu Zhang,Jianhua Peng,Jian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preliminary theory note
Abstract:This note studies exact constant collapse in variational autoencoders, where the encoder mean becomes independent of the input. The goal is to make this specific failure mode pre-designable, monitorable during training, and certifiable after training. The prior is kept as the standard Gaussian. Given a fixed teacher posterior, we attach to the latent mean a fixed simplex witness head. The resulting teacher-student alignment loss has an exact constant-predictor baseline equal to the teacher information. If the alignment loss is below this baseline, the latent mean cannot be input-independent constant collapsed. The simplex witness also has a closed-form inverse. Any full-support teacher posterior can be represented by embedding its centered log-odds into the latent space. This gives an explicit latent energy cost and explains when the alignment loss can be made small. A computable view gap handles the case where teacher targets are computed from a different view. Thus exact constant collapse is converted from an after-the-fact training pathology into a design-and-certificate problem. Comments: Preliminary theory note Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18224 [cs.LG] (or arXiv:2605.18224v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18224 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models
链接: https://arxiv.org/abs/2605.18202
作者: Samuele Bortolotti,Emanuele Marconato,Andrea Pugnana,Andrea Passerini,Stefano Teso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model’s decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata – consistency, coverage, and conciseness – that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.
[AI-56] Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
链接: https://arxiv.org/abs/2605.18191
作者: Guining Cao,Jiaxin Peng,Chu Zeng,Yu Zhao,Shuangyong Song,Yongxiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.
[AI-57] Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLM s
链接: https://arxiv.org/abs/2605.18172
作者: Junyu Pan,Yansen Wang,Enze Zhang,Baoliang Lu,Weilong Zheng,Dongsheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.
[AI-58] Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework
链接: https://arxiv.org/abs/2605.18150
作者: Mengyu Sun,Ziyuan Yang,Zunlong Zhou,Junxu Liu,Haibo Hu,Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.
[AI-59] Evidence-Grounded Frontier Mapping and Agent ic Hypothesis Generation in Nanomedicine
链接: https://arxiv.org/abs/2605.18144
作者: Christiaan G.A. Viviers,Koen de Bruin,Mirre M. Trines,Ayla M. Hokke,Roy van der Meel,Avi Schroeder,Twan Lammers,Willem J.M. Mulder,Fons van der Sommen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.
[AI-60] Generative AI and the Productivity Divide: Human-AI Complementarities in Education
链接: https://arxiv.org/abs/2605.18143
作者: Lihi Idan,Bharat Anand
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textitAI Interaction Competence (AIC) – the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.
[AI-61] POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection
链接: https://arxiv.org/abs/2605.18128
作者: Suofei Zhang,Yaxuan Zheng,Haifeng Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at this https URL.
[AI-62] Safety Geometry Collapse in Multimodal LLM s and Adaptive Drift Correction
链接: https://arxiv.org/abs/2605.18104
作者: Jiahe Guo,Xiangran Guo,Jiaxuan Chen,Weixiang Zhao,Yanyan Zhao,Yutai Hou,Qianchao Wang,Dandan Tu,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model’s perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.
[AI-63] Learning to Solve Compositional Geometry Routing Problems
链接: https://arxiv.org/abs/2605.18094
作者: Mingfeng Fan,Jianan Zhou,Jiaqi Cheng,Yifeng Zhang,Jie Zhang,Guillaume Adrien Sartoretti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures
Abstract:We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.
[AI-64] A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
链接: https://arxiv.org/abs/2605.18073
作者: Anika Tabassum,Md Sifat Hossain,Md. Fahim Arefin,Tariqul Islam,Tarannum Shaila Zaman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for Publication in ACM Transactions on Software Engineering and Methodology (TOSEM)
Abstract:Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.
[AI-65] Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing
链接: https://arxiv.org/abs/2605.18068
作者: Seyed Mohamad Moghadas,Esther Rodrigo Bonet,Bruno Cornelis,Adrian Munteanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds
[AI-66] FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction ICML2026
链接: https://arxiv.org/abs/2605.18055
作者: Qi Si,Penglei Wang,Yushuai Wu,Yifeng Jiao,Xuyang Liu,Xin Guo,Yuan Qi,Yuan Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026
Abstract:Predicting spatial gene expression from routine H\E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbfFLAG, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbfGene Dimension Curse, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbfGSC) and Spatial Structural Correlation (\textbfSSC). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at this https URL.
[AI-67] DocOS: Towards Proactive Document-Guided Actions in GUI Agents
链接: https://arxiv.org/abs/2605.18048
作者: Jingjing Liu,Ziye Huang,Zihao Cheng,Zeming Liu,Jiahong Wu,Yuhang Guo,Kehai Chen,Yunhong Wang,Haifeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbfProactive Document-Guided Action for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents’ capability in this paradigm, we propose \textbfDocOS, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.
[AI-68] Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help? ICRA2026
链接: https://arxiv.org/abs/2605.18045
作者: Johannes A. Gaus,Jhon P.F. Charaja,Daniel Haeufle
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA 2026 workshop paper
Abstract:Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.
[AI-69] New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions ICLR2024
链接: https://arxiv.org/abs/2605.18035
作者: Xinzhe Yuan(1),William de Vazelhes(2),Bin Gu(2 and 3),Huan Xiong(1 and 2) ((1) Harbin Institute of Technology, (2) Mohamed bin Zayed University of Artificial Intelligence, (3) Jilin University)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong
Abstract:Hard-thresholding is an important type of algorithm in machine learning that is used to solve \ell_0 constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling \ell_0 sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.
[AI-70] FedSDR: Federated Self-Distillation with Rectification ICML2026
链接: https://arxiv.org/abs/2605.18028
作者: Ziheng Ren,Zhanming Shen,Hao Wang,Ning Liu,You Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,‘’ FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox – unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.
[AI-71] Com-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications? KDD2026
链接: https://arxiv.org/abs/2605.18025
作者: Jieting Xiao,Yun Lin,Huizhen Qiu,Rui Ma,Chen Zhong,Dongyang Xu,Xiao Long,Chaoyu Zhang,Qiaobo Hao,Ding Zou,Zhiguo Yang,Yanqin Gao,Fang Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at this https URL.
[AI-72] Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise
链接: https://arxiv.org/abs/2605.18022
作者: Linyu Liu,Pinyan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 27 pages, 32 figures
Abstract:Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.
[AI-73] Spiker-LL: An Energy-Efficient FPGA Accelerator Enabling Adaptive Local Learning in Spiking Neural Networks
链接: https://arxiv.org/abs/2605.18003
作者: Alessio Caviglia,Filippo Marostica,Alessandro Savino,Stefano Di Carlo
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying adaptive intelligence at the edge remains challenging due to the high computational and energy cost of training neural models. Spiking Neural Networks (SNNs) offer a promising alternative, but enabling on-device learning requires hardware-algorithm co-design. This paper presents SPIKER-LL, an FPGA-based SNN accelerator that extends the open-source Spiker+ inference architecture with efficient support for the STSF local learning rule. Through targeted microarchitectural extensions, SPIKER-LL performs inference and online learning with minimal overhead. Across MNIST, F-MNIST, and DIGITS, it achieves up to 93% accuracy, sub-millisecond latency, and less than 0.1 mJ per inference, while remaining DSP-free and highly scalable for edge-FPGA deployments.
[AI-74] Shared Backbone PPO for Multi-UAV Communication Coverag e with Connection Preservation
链接: https://arxiv.org/abs/2605.17999
作者: Z. Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a Shared Backbone Proximal Policy Optimization (Shared Backbone PPO) algorithm. By sharing the base module between the Actor and Critic networks, the algorithm achieves efficient training and improved performance. The algorithm is implemented in a connectivity-preserving multi-UAV swarm communication coverage task and compared with the standard PPO algorithm. Experimental results demonstrate that the proposed method achieves superior performance. Furthermore, a graph information aggregation module is incorporated into the model architecture to accommodate the communication conditions among agents. With the integration of this module, the algorithm remains effective, and the trained agent swarm exhibits a higher level of cooperation.
[AI-75] Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study
链接: https://arxiv.org/abs/2605.17998
作者: Hai-Duong Nguyen,The-Xuan Tran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 39 pages, 2 figures, 17 tables. Preprint
Abstract:As multi-agent systems move from short interactions to tool-using workflows with specialized roles and persistent state, completion becomes a runtime-control problem rather than a purely generative one. This preprint studies verify-gated completion as an admission-control pattern for governed multi-agent runtimes: agents may propose completion, but a read-only verifier decides whether the claim is admitted. Ambiguous or weakly evidenced cases resolve fail-closed, while packetized state and event traces preserve an audit path. We examine one bounded reference implementation and ask what the released evidence can support about auditable, verify-gated completion. In the released verify-completed slice, the known-outcome invoked-event verify success share was 1,791/1,800 = 99.5%. This is an accounting measure over invoked verification events, not a task-completion, production-reliability, or benchmark-success rate. Task-level verify coverage is not computable; 1,762/1,801 rows came from one high-volume reporting cluster; and only 17 events were production-classified. A shadow Policy/Governance Verifier evaluation showed 1,526/1,548 = 98.58% rule agreement, 0/1,526 false-success among safe-to-proceed predictions, and blocked precision of 2/518 = 0.39%, so it remains advisory. The evidence supports a narrow conclusion: under observed conditions, a read-only verify gate plus packetized admission records made completion decisions inspectable and fail-closed. Claims about deployed operation, safety guarantees, outcome gains, task-level coverage, recovery effectiveness, or external validity remain outside scope. Comments: 39 pages, 2 figures, 17 tables. Preprint Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17998 [cs.SE] (or arXiv:2605.17998v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.17998 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-76] Stable Audio 3
链接: https://arxiv.org/abs/2605.17991
作者: Zach Evans,Julian D. Parker,Matthew Rice,CJ Carr,Zack Zukowski,Josiah Taylor,Jordi Pons
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Training code: this https URL Inference and weights: this http URL
Abstract:Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
[AI-77] LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injectio
链接: https://arxiv.org/abs/2605.17986
作者: Lei Zhao,Abhay Bhaskar,Edgar Dobriban
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.
[AI-78] SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models
链接: https://arxiv.org/abs/2605.17985
作者: Chengjie Hong,Feixiang He,Yiheng Zeng,Lulu Kang,He Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.
[AI-79] Unleashing LLM s in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery ICLR2026
链接: https://arxiv.org/abs/2605.17976
作者: Xinzhe Yuan,Zhuo Chen,Jianshu Zhang,Huan Xiong,Nanyang Ye,Yuqiang Li,Qinying Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures
Abstract:Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf90% of the best observed value within 6 iterations, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.
[AI-80] Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
链接: https://arxiv.org/abs/2605.17971
作者: Ziwei Wang,Jing Chen,Ruichao Liang,Zhi Wang,Yebo Feng,Ju Jia,Ruiying Du,Cong Wu,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.
[AI-81] Reconciling Contradictory Views on the Effectiveness of SFT in LLM s: An Interaction Perspective
链接: https://arxiv.org/abs/2605.17967
作者: Junpeng Zhang,Lei Cheng,Guoxi Zhang,Hua Cai,Qing Xu,Quanshi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.
[AI-82] BLAgent : Agent ic RAG for File-Level Bug Localization
链接: https://arxiv.org/abs/2605.17965
作者: Md Afif Al Mamun,Gias Uddin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under review at the ACM Transactions on Software Engineering and Methodology
Abstract:Bug localization remains a key bottleneck in downstream software maintenance tasks, including root cause analysis, triage, and automated program repair (APR), despite recent advances in large language model (LLM)-based repair systems. File-level bug localization is especially critical in hierarchical pipelines, where errors can propagate to downstream stages such as statement-level localization or patch generation. While Retrieval-Augmented Generation (RAG) offers a promising direction for grounding LLMs in repository context, existing RAG pipelines rely on static retrieval and lack the reasoning needed to identify faulty code accurately. In this work, we present BLAgent, a novel agentic RAG framework for file-level bug localization that integrates three key ideas: (i) code structure-aware repository encoding with path-augmented AST-based chunking, (ii) dual-perspective query transformation capturing both structural and behavioral signals, and (iii) two-phase agentic reranking combining symbolic inspection with evidence-grounded reasoning. Unlike prior graph-based or multi-hop agentic approaches, BLAgent performs bounded reasoning over a compact candidate set, balancing accuracy and cost. On SWE-bench Lite, BLAgent attains over 78% Top-1 accuracy with open-source models and over 86% with a closed-source model, while being over 18x cheaper than the strongest baseline using the same model. When integrated into an APR framework, it improves end-to-end repair success by over 20%.
[AI-83] raining data attribution in diffusion models via mirrored unlearning and noise-consistent skew
链接: https://arxiv.org/abs/2605.17938
作者: Joan Serrà,Dipam Goswami,Fabio Morreale,Wei-Hsiang Liao,Yuki Mitsufuji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 21 pages, 5 figures, 9 tables (includes appendix)
Abstract:Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.
[AI-84] AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training
链接: https://arxiv.org/abs/2605.17923
作者: Yucheng Guo,Yongjian Guo,Zhong Guan,Haoran Sun,Wen Huang,Wanting Xu,Jing Long,Shuai Di,Junwu Xiong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on “equal token length” constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textitAdaptiveLoad, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load ( B \times S^p \le M_\textcomp ); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39% to 18.9%, improves peak VRAM utilization efficiency by 22.7%, and achieves an overall training throughput increase of 27.2%. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.17923 [cs.DC] (or arXiv:2605.17923v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.17923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-85] Ethical Hyper-Velocity (EHV): A Provably Deterministic Governance-Aware JIT Compiler Architecture for Agent ic Systems
链接: https://arxiv.org/abs/2605.17909
作者: Riddhi Mohan Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 11 pages, 3 TikZ figures, 1 table. Bounded TLA+ formal specification and model checking verification logs included as supplementary artifacts
Abstract:As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We introduce Ethical Hyper-Velocity (EHV), a novel architectural framework for the formal verification of AI governance policies at runtime. Unlike retrospective auditing frameworks (ISO/IEC 42001, NIST AI RMF) which introduce 14-30 day latencies, EHV relocates the Policy Enforcement Point (PEP) into the inference pipeline via a Governance-Aware Just-In-Time (JIT) Compiler. By integrating Conflict-free Replicated Data Types (CRDTs) for policy synchronization and Epoch-based Attestation Caching within Trusted Execution Environments (TEEs), EHV achieves Sub-millisecond Formal Determinism (SMFD). We demonstrate via TLA+ formal verification that non-compliant agentic actions are computationally unreachable within the system’s bounded operating state space. We prove that O(1) runtime enforcement can eliminate the traditional trade-off between deployment velocity and governance integrity, reducing Governance Latency from O(days) to O(1).
[AI-86] LAST-RAG : Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection
链接: https://arxiv.org/abs/2605.17902
作者: Hanbyeol Park,Hyerim Bae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.
[AI-87] DuIVRS-2: An LLM -based Interactive Voice Response System for Large-scale POI Attribute Acquisition ACL2026
链接: https://arxiv.org/abs/2605.17900
作者: Le Zhang,Shengming Zhang,Rui Zha,Yunpeng Wu,Jingbo Zhou,Jizhou Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Industry Track. 14 pages, including appendix
Abstract:Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.
[AI-88] DCFold: Efficient Protein Structure Generation with Single Forward Pass
链接: https://arxiv.org/abs/2605.17899
作者: Zhe Zhang,Yuanning Feng,Yuxuan Song,Keyue Qiu,Hao Zhou,Wei-Ying Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15x acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.
[AI-89] Evaluating Cognitive Age Alignment in Interactive AI Agents
链接: https://arxiv.org/abs/2605.17894
作者: Yifan Shen,Jiawen Zhang,Jian Xu,Junho Kim,Ismini Lourentzou,Xu Cao,Meihuan Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.
[AI-90] Attention Sinks and Outliers in Attention Residuals
链接: https://arxiv.org/abs/2605.17887
作者: Haozheng Luo,Haoran Dai,Shaoyang Zhang,Xi Chen,Eric Hanchen Jiang,Yijiang Li,Jingyuan Huang,Chenghao Qiu,Chenwei Xu,Zhenyu Pan,Haotian Zhang,Binghui Wang,Yan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.
[AI-91] Guard: Scalable Strag gler Detection and Node Health Management for Large-Scale Training
链接: https://arxiv.org/abs/2605.17879
作者: Guanliang Liu,Abhinandan Patni,Congzhu Lin,Zoe Zeng,Jack Wittmayer,Josh Wu,Ashvin Nihalani,Binxuan Huang,Yinghong Liu,Rory Na,Anthony Ko,Alexander Zhipa,Cong Cheng,Mi Sun,Vijay Rajakumar,Rejith George Joseph,Parthasarathy Govindarajen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 9 th MLSys Conference, Bellevue, WA, USA, 2026
Abstract:Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
[AI-92] PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
链接: https://arxiv.org/abs/2605.17877
作者: Wonjoong Kim,Yeonjun In,Sangwu Park,Dongha Lee,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.
[AI-93] boldsymbolf-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control
链接: https://arxiv.org/abs/2605.17862
作者: Xianwei Chen,Shimin Zhang,Jibin Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.
[AI-94] KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agent ic Earth Science
链接: https://arxiv.org/abs/2605.17856
作者: Ziwei Li,Liujun Zhu,Yuchen Liu,Yichen Zhao,Birk Li,Ruiqi Wu,Junliang Jin,Jianyun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.
[AI-95] Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization
链接: https://arxiv.org/abs/2605.17839
作者: Anh B.H. Nguyen,Ba Tho Phan,Viet Cuong Ta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these meth- ods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we pro- pose BiKD - a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced vali- dation set. The student is now trained with an unconstrained combina- tion of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the weight model more accurately and efficiently. Experiments on long-tailed CIFAR-10/100 show that our approach surpasses recent balanced distil- lation methods across imbalance factors.
[AI-96] Efficient Bilevel Optimization for Meta Label Correction in Noisy Label Learning
链接: https://arxiv.org/abs/2605.17833
作者: Ba Hoang Anh Nguyen,Viet Cuong Ta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training a deep neural network with noisy labels could reduce data annotation cost but may introduce noise into the learned model. In meta label correction approaches, an additional meta model besides the main model is trained with a small, clean dataset to correct the large, noisy dataset. However, the update of the meta model requires the computation of hypergradients at the inner step of the main model which signif- icantly increases the computational cost. To improve the training efficiency, we first introduce the dynamic barrier gradient descent into standard meta label correction. While this naive extenstion is able to speed up the training process to approximately first- order complexity, it lacks mechanisms to prevent the leakage of noisy signals to the main model and to stabilize the learning of the meta model. Based on this observation, we propose the EBOMLC method, which is designed with three key improvements including one-step inner loop update, mixture upper loss and alignment- aware dynamic barrier. Empirical results on CIFAR-10 and CIFAR-100 demonstrate that EBOMLC consistently outperforms other baselines, especially under high noise rate settings, while reducing training time of the meta label correction approach.
[AI-97] Interactive Evaluation Requires a Design Science
链接: https://arxiv.org/abs/2605.17829
作者: Keyang Xuan,Peiyang Song,Pan Lu,Pengrui Han,Wenkai Li,Zhenyu Zhang,Zexue He,Wenyue Hua,Manling Li,Jiaxuan You,Adrian Weller,Yizhong Wang,Jiaxin Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.
[AI-98] Content-Style Identification via Differential Independence ICML2026
链接: https://arxiv.org/abs/2605.17827
作者: Subash Timilsina,Hoang-Son Nguyen,Sagar Shrestha,Xiao Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 15 figures, ICML 2026
Abstract:Generative analysis often models multi-domain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive in practice. In this work, we introduce content-style differential independence (CSDI), an alternative structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation.
[AI-99] rCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
链接: https://arxiv.org/abs/2605.17821
作者: Shujie Han,Feng Jiang,Patrick P. C. Lee,Xiao Zhang,Zhijie Huang,Nannan Zhao,Xiaonan Zhao,Lichen Pan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.
[AI-100] Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement
链接: https://arxiv.org/abs/2605.17815
作者: Hao Lu,Rahul Shome
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.
[AI-101] Going Headless? On the Boundaries of Vertical AI Firms
链接: https://arxiv.org/abs/2605.17812
作者: Muhammad Zia Hydari,Farooq Muzaffar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vertical AI firms in accounting, law, healthcare, procurement, and similar domains historically bundled workflow, domain logic, and accountability into a single application. General-purpose AI agents are now unbundling that package, prompting founders and investors to advocate “going headless”: cede the workflow and interface to agents and expose domain expertise as callable services. This article argues that going headless is correct for some firms and destructive for others, and that the latter often cede their value capture inadvertently through architectural choices that look like interface decisions. This is a boundary question, and the answer turns on distinguishing the interface boundary, which can often move, from the accountability boundary, which often must not. Drawing on Coase’s theory of the firm, Eisenmann, Parker, and Van Alstyne’s platform envelopment framework, and Teece’s analysis of complementary assets and appropriability, the article shows that orchestrators operating through open protocols acquire envelopment power even as technical interoperability improves, and that durable value capture concentrates in cospecialized accountability assets: professional signoff, regulated workflows, evidence trails, and trusted systems of record. The article proposes a three-position taxonomy (component, integrated software platform, dual-track) determined not by sector but by task-accountability regime, and formalizes the construct of rule debt: the future governance, maintenance, and accountability burden that accrues to customer organizations when business rules and professional standards migrate from governed systems into prompts and agent instructions. Four principles follow: decompose by accountability not interface, invert the edges while retaining the core, position rule debt as the customer cost the integrated platform prevents, and avoid single-orchestrator dependence.
[AI-102] One Model Two Roles: Emergent Specialization in a Shared Recurrent Transformer
链接: https://arxiv.org/abs/2605.17811
作者: Jucheng Shen,Barbara Su,Anastasios Kyrillidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 21 pages, 13 figures, 8 tables
Abstract:Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: \zH behaves like a fully committed proposal state, whereas \zL retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model’s state dynamics: in Sudoku, freezing \zH reduces \zL 's content changes whereas freezing \zL increases \zH 's, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \hrefthis https URL\textcolorbluethis https URL.
[AI-103] Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling
链接: https://arxiv.org/abs/2605.17809
作者: Chun-Hsiung Tseng,Hao-Chiang Koong Lin,Andrew Chih-Wei Huang,Yung-Hui Chen,Jia-Rou Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.
[AI-104] Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers
链接: https://arxiv.org/abs/2605.17800
作者: Hao Lu,Rahul Shome
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by WAFR 2026, 18 pages, 6 figures
Abstract:Rearranging densely packed tabletop objects is challenging when parallel-gripper picks are infeasible without sufficient clearance around an object. This work studies the problem characteristics for practically motivated settings with uniformly sized blocks placed at planar tabletop grid locations. Since purely prehensile removal can become infeasible, a directional knock primitive is therefore introduced and the optimal knock-pick variant of the problem is formulated. The work proposes a series of abstractions wherein minimal constraining gadgets are covered to identify the necessary knocks. Utilizing a maximum-weight perfect matching on a graphical abstraction yields efficient polynomial-time computation of the optimal plan that minimizes the number of actions. Experiments are reported for increasing grid sizes in synthetic settings as well as in IsaacSim. The theoretical observations provide a promising stepping stone towards rigorously building efficient manipulation strategies that interleave prehensile and non-prehensile actions.
[AI-105] STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery
链接: https://arxiv.org/abs/2605.17790
作者: Jiarui Su,Songjun Tu,Bei Sun,Xiaojun Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 15 figures
Abstract:LLM-based equation discovery offers a promising route to recovering symbolic laws from data, but many systems still rely on generation-centered loops that propose candidates, fit parameters, score results, and reuse selected examples. Such loops can misjudge useful skeletons under unreliable fitting, discard near-correct equations that require repair, and accumulate redundant memories that provide limited guidance. We propose STRIDE, a self-reflective agent framework that improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic–executor repair, and diversity-preserving semantic memory. By turning fitted scores and candidate behavior into shared feedback, STRIDE enables equations to be proposed, assessed, refined, and reused within a closed-loop discovery process. Experiments on representative symbolic-regression benchmarks and LSR-Synth suites show that STRIDE improves accuracy, OOD robustness, and structural recovery across multiple LLM backbones, with ablations and analyses confirming the contribution of its core components.
[AI-106] Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search SIGIR2026
链接: https://arxiv.org/abs/2605.17762
作者: Paul Greyson,Zhichao Geng,Wei Zhang,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted at SIGIR 2026 industry track
Abstract:Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration’’ to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system’s ability to learn from long-tail queries. In this work, we present a \textbfrobust neural sparse retrieval system designed to maximize exploration efficiency. We adapt a state-of-the-art \textbfinference-free sparse retrieval architecture to the music domain, combining it with an effective \textbfdomain-specific granular subword tokenization strategy. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf91.4% recall@10 (vs. \textbf57.7% for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf+0.8% higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.
[AI-107] OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
链接: https://arxiv.org/abs/2605.17757
作者: Zhongzhu Zhou,Donglin Zhuang,Jisen Li,Ziyan Chen,Shuaiwen Leon Song,Ben Athiwaratkun,Xiaoxia Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 35 pages, 10 figures
Abstract:INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead. Comments: 35 pages, 10 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2605.17757 [cs.LG] (or arXiv:2605.17757v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17757 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-108] Harnessing LLM Agents with Skill Programs
链接: https://arxiv.org/abs/2605.17734
作者: Hongjun Liu,Yifei Ming,Shafiq Joty,Chen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 7 figures
Abstract:Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.
[AI-109] Divergence-Suppressing Couplings for Rectified Flow
链接: https://arxiv.org/abs/2605.17733
作者: Yimeng Min,Carla P. Gomes
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.
[AI-110] L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting
链接: https://arxiv.org/abs/2605.17730
作者: Fan Zhang,Shijun Chen,Hua Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from history to the future in the observation space to fit value-level dependencies. However, real-world systems often undergo distribution shifts and regime changes. In such cases, a unified mapping can exhibit response lag around turning points, causing error accumulation within the switching window and reducing forecasting reliability. To address this issue, we propose L-Drive, a change-aware forecasting framework. L-Drive introduces a Latent-Context, to explicitly characterize high-level dynamics evolving over time, and uses gating to modulate increment representations. This provides more timely change cues and improves adaptation to changing segments. In addition, it incorporates patch-shared relative positional basis functions to strengthen intra-segment structural modeling and reduce overfitting caused by absolute-position memorization. Extensive experiments validate the effectiveness of L-Drive and show a better overall trade-off between forecasting accuracy and computational efficiency.
[AI-111] EXG: Self-Evolving Agents with Experience Graphs
链接: https://arxiv.org/abs/2605.17721
作者: Yuxin Jin,Siyuan Zhang,Hanchen Wang,Lu Qin,Ying Zhang,Wenjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.
[AI-112] Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization
链接: https://arxiv.org/abs/2605.17693
作者: Yuan Xue,Daniel Kudenko,Megha Khosla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at this https URL .
[AI-113] EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness
链接: https://arxiv.org/abs/2605.17684
作者: Jingni Huang,Peter Bloodsworth
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.
[AI-114] PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
链接: https://arxiv.org/abs/2605.17671
作者: Michael Arbel,Basile Terver,Jean Ponce
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.
[AI-115] Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models
链接: https://arxiv.org/abs/2605.17669
作者: Yang Zhang,Nada Mimouni,Jean-Claude Moissinac,Fayçal Hamdi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point
[AI-116] LLM Forge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
链接: https://arxiv.org/abs/2605.17653
作者: Xinting Jiang,Junyi Luo,Ruichen Qi,Kauna Lei,Ben Laurie,Gregory Kielian,Mehdi Saligane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate’s cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.
[AI-117] SAPO: Step-Aligned Policy Optimization for Reasoning -Based Generative Recommendation
链接: https://arxiv.org/abs/2605.17648
作者: Zaiyi Zheng,Guanghui Min,Yaochen Zhu,Liang Wu,Liangjie Hong,Chen Chen,Jundong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder’s own decomposition of the output.
[AI-118] WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
链接: https://arxiv.org/abs/2605.17637
作者: Wenyu Zhang,Guoliang You,Tianlun,Haotian Zhao,Tianshu Zhu,Haoran Wang,Xiaoxuan Tang,Mingyang Dai,Jingnan Gu,Daxiang Dong,Jianmin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.
[AI-119] Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
链接: https://arxiv.org/abs/2605.17625
作者: Nikola Milosevic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a “Sim-to-Real” gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.
[AI-120] Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors
链接: https://arxiv.org/abs/2605.17618
作者: Yadhu Kartha,Conor Anderson,Jenny Foster,Theresa Hamlin,Johanna Lantz,Ryan Lay,Juergen Hahn,Gari D. Clifford,Hyeokhyen Kwon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms
[AI-121] GraphMind: From Operational Traces to Self-Evolving Workflow Automation
链接: https://arxiv.org/abs/2605.17617
作者: Yiwen Zhu,Joyce Cahoon,Anna Pavlenko,Qiushi Bai,Nima Shahbazi,Divya Vermareddy,Meina Wang,Mathieu Demarne,Swati Bararia,Wenjing Wang,Hemkesh Vijaya Kumar,Hannah Lerner,Katherine Lin,Steve Toscano,Miso Cilimdzic,Subru Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Complex operational workflows coordinating personnel, tools, and information are central to enterprise operations, yet end-to-end automation remains challenging due to extensive requirements for human inputs and the inability to adapt over time. We present GraphMind, an end-to-end system that constructs, executes, and evolves action-centric workflow graphs without human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths and decays stale elements. This closed-loop mechanism enables the graph to self-optimize and adapt to shifting operational conditions. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on production data, the system substantially outperforms a Trace-RAG baseline in mitigation reach, groundedness, and diagnostic throughput, scoring 4.95/5 in blind expert review. The ATR layer provides further gains across all metrics, demonstrating that workflow graphs can learn and improve from execution-derived feedback.
[AI-122] Bayesian-Monte Carlo Schedule Updating for Construction Digital Twins: A Probabilistic Framework for Dynamic Project Forecasting
链接: https://arxiv.org/abs/2605.17608
作者: Atena Khoshkonesh,Mohsen Mohammadagha,Vinayak Kaushal,Navid Ebrahimi
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures, 5 tables
Abstract:Construction projects frequently experience schedule delays and forecasting uncertainty due to variability in labor productivity, material availability, weather conditions, and project coordination. Conventional deterministic scheduling methods such as the Critical Path Method (CPM) assume fixed activity durations and therefore cannot adequately represent dynamic project uncertainty. This study presents a Bayesian-Monte Carlo probabilistic schedule updating framework for construction digital twin environments. The proposed methodology integrates stochastic activity-duration modeling, Bayesian recursive updating, Monte Carlo simulation, and uncertainty propagation within a unified computational framework for adaptive schedule forecasting. Activity durations are modeled using lognormal probability distributions and continuously updated through Bayesian inference as new project observations become available. Monte Carlo simulation is then used to propagate updated uncertainty throughout project networks and generate probabilistic completion-time forecasts, delay-risk estimates, and activity criticality measures. Simulation experiments using PSPLIB benchmark project networks demonstrate that the proposed framework improves forecasting accuracy and uncertainty representation compared with deterministic CPM and static probabilistic scheduling approaches. The framework further supports adaptive project forecasting through integration of BIM reports, drone observations, IoT telemetry, productivity logs, and site monitoring data. Comments: 22 pages, 3 figures, 5 tables Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17608 [cs.CE] (or arXiv:2605.17608v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2605.17608 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Atena Khoshkonesh [view email] [v1] Sun, 17 May 2026 19:09:37 UTC (918 KB)
[AI-123] NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent Self-Curating LLM Agents
链接: https://arxiv.org/abs/2605.17596
作者: Mujahid Sultan,Sri Thuraisamy,Daya Rajaratnam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro-symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue with a CLIPS-based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject-relation-value triples stored in relational database management system, supports user/agents/agent-to-agents scoping, and implements a dual-horizon short-term/long-term memory model with access-based promotion and time-based pruning. NeuSymMS maintains continuity of memory while avoiding context-window bloat and cross-entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key-value approaches.
[AI-124] ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
链接: https://arxiv.org/abs/2605.17580
作者: Zhikang Chen,Yue Wang,Sen Cui,Yu Zhang,Changshui Zhang,Tianling Ren,Tingting Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.
[AI-125] UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts
链接: https://arxiv.org/abs/2605.17575
作者: Tongze Wang,Xiaohui Xie,Wenduo Wang,Chuyi Wang,Yong Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emphdomain alignment fine-tuning, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emphstable model ensembling, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51% and average F1 score by 2.71%, outperforming the strongest baseline by 1.45% in accuracy and 1.69% in F1 score, while requiring only 12.4%–53.9% of the training time of all NTC-specific baselines.
[AI-126] Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting
链接: https://arxiv.org/abs/2605.17556
作者: Peter Schaldenbrand,Jean Oh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Abstract:Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.
[AI-127] Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers Rubrics and Cognitive Traps
链接: https://arxiv.org/abs/2605.17554
作者: Tanmay Asthana,Aman Saksena,Divyansh Sahu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages
Abstract:Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant’s typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean = 2.5 and verifier rate = 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents’ MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others’ rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells. Comments: 37 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.17554 [cs.AI] (or arXiv:2605.17554v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.17554 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tanmay Asthana [view email] [v1] Sun, 17 May 2026 17:32:52 UTC (57 KB)
[AI-128] Rethinking Code Review in the Age of AI: A Vision for Agent ic Code Review ICSE
链接: https://arxiv.org/abs/2605.17548
作者: Hüseyin Özgür Kamalı,Erdem Tuna,Vahid Haratian,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 44 pages, 5 figures. A shorter version of this work was presented at ICSE-JAWs 2026, Rio de Janeiro, Brazil. This extended version is intended for journal submission
Abstract:Code review has evolved for decades, from informal peer checking to today’s pull request (PR) workflows, yet it remains a largely manual, uneven, and cognitively demanding process. The rise of Artificial Intelligence (AI) coding assistants has intensified this challenge: while these tools increase code production velocity, they also expand the volume of code requiring review, turning code review into a growing bottleneck. Current AI support remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow. In this paper, we review the historical evolution of code review practices and examine the shift driven by large language models (LLMs) and agentic AI systems. We then present a vision for an AI-powered code review workflow combining specialized agents with human-controlled quality gates. Our framework spans five stages: PR Creation, PR Augmentation, Reviewer Selection, AI-Assisted Code Review, and PR Retrospective, with humans retained at key decision points to preserve judgment, accountability, and team-level understanding. We identify major open challenges for responsible adoption, including reliability, bias, privacy, automation bias, transparency, and evaluation, and offer a research agenda for more effective human-AI collaboration in software engineering.
[AI-129] Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis
链接: https://arxiv.org/abs/2605.17539
作者: Fatemeh Haji,Javier Delarosa Quiros,Peyman Najafirad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Combinatorial optimization (CO) underlies decision-making from logistics to chip design, where infeasible solutions are operationally unusable and small quality gains translate into substantial economic value. Recent work uses large language models (LLMs) to automate solver synthesis: generating executable solver programs from natural-language specifications. However, existing tree-search and evolutionary agents refine candidate trajectories in parallel without explicit knowledge transfer, reintroducing the same constraint violations and converging on similar algorithm families. We introduce MEMOIR, a memory-guided tree-search framework with a two-level memory hierarchy: branch-local memory preserves execution-grounded refinement details within a branch as it iterates on a single algorithmic design, while global memory stores compressed algorithmic and failure-mode summaries across branches. A reflection step at branch termination distills these summaries, enabling cross-branch transfer without polluting future contexts with low-level debugging traces. Across seven CO problems spanning scheduling, routing, packing, and geometric design, MEMOIR achieves 96.7% solution validity (a 9.2 point gap over the strongest baseline) and improves the average normalized score by 7.3 points at matched per-method execution budget. Over three independent runs on four problems, MEMOIR’s run-to-run validity standard deviation is more than an order of magnitude below that of every baseline we evaluated in this setting, suggesting that memory-guided exploration yields consistent improvements rather than reflecting sampling variance.
[AI-130] Self-supervised Hierarchical Visual Reasoning with World Model
链接: https://arxiv.org/abs/2605.17537
作者: Yuanfei Xu,Lin Liu,Wengang Zhou,Mingxiao Feng,Houqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the “Bitter Lesson”, ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at \urlthis https URL.
[AI-131] Few-Shot Network Intrusion Detection Using Online Triplet Mining
链接: https://arxiv.org/abs/2605.17530
作者: Jack Wilkie,Hanan Hindy,Christos Tachtatzis,Miroslav Bures,Robert Atkinson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: Published in: MDPI Applied Sciences, 2026. Official version: this https URL Code: this https URL
Abstract:Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State-of-the-art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non-conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false-positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few-shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state-of-the-art approaches in few-shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.
[AI-132] SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
链接: https://arxiv.org/abs/2605.17526
作者: Qingnan Ren,Shun Zou,Shiting Huang,Ziao Zhang,Kou Shi,Zhen Fang,Yiming Zhao,Yu Zeng,Qisheng Su,Lin Chen,Yong Wang,Zehui Chen,Xiangxiang Chu,Feng Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \urlthis https URL.
[AI-133] BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation
链接: https://arxiv.org/abs/2605.17508
作者: Yuhan Xie,Chen Lyu,Jingrong Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.
[AI-134] he Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
链接: https://arxiv.org/abs/2605.17480
作者: Qiqi Liu,Thorsten Holz,Shilin Ye,Runhan Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 6 figures
Abstract:Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify \emphsemantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a \emphcapability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by \emphlinguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ( n_W =14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ( n_W =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose \emphheterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting–rather than eliminating–capability asymmetries between agents.
[AI-135] Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination
链接: https://arxiv.org/abs/2605.17454
作者: Xiaolei Fang,Peilan Xu,Wenjian Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 7 figures
Abstract:Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring (\Theta(n^2)) expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in (O(n\log n)) expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a (2\lambda)-common approximation cover with (\lambda\in[1,2]). We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.
[AI-136] DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
链接: https://arxiv.org/abs/2605.17439
作者: Sirui Hong,Zhijie Liu,Tengfei Li,Wei Tao,Yifan Wu,Chenglin Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at this https URL.
[AI-137] MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings
链接: https://arxiv.org/abs/2605.17431
作者: Himchan Hwang,Hyeokju Jeong,Gene Chung,Seungyeon Kim,Sangwoong Yoon,Frank Chongwoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior’s permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.
[AI-138] Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning
链接: https://arxiv.org/abs/2605.17428
作者: Wu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Our preliminary experiments on gym-DSSAT maize irrigation tasks revealed that +/-2 degrees C temperature noise causes an 11.9% reduction in economic returns for PPO policies trained under clean conditions - a systematic robustness deficit that existing research has not adequately addressed. This paper tackles three interconnected limitations impeding practical deployment of agricultural RL systems: the trade-off between early-stage learning efficiency and late-stage generalization capability; the naive additive combination of intrinsic and extrinsic rewards in exploration-augmented PPO; and uniform measurement noise injection strategies that disregard empirically validated differential sensitivity across agricultural state variables. We introduce three systematic innovations: Progressive Generalization Augmentation (PGA) implementing a three-phase curriculum (clean training 0-800 episodes, progressive 800-1200, full augmentation 1200-2000); a deeply coupled RND-PPO architecture with dual-channel GAE normalization, progress-decayed intrinsic coefficients, and semantic discretization; and domain-prioritized noise injection with hierarchical activation. Our experimental evaluation demonstrates: 8.43% yield improvement and 16.42% nitrogen use efficiency improvement over SOTA BERT-DQN in Florida; 5.61% yield improvement in Zaragoza (though 3.67% lower economic score due to challenging Mediterranean climate); and 94.4% vs 80.0% performance retention under combined perturbations. All experiments used 5 random seeds on NVIDIA A100 GPUs with 4.2+/-0.3 hours per run (2000 episodes, 2048-step buffer, 64 mini-batch size). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17428 [cs.LG] (or arXiv:2605.17428v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17428 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wu Yang [view email] [v1] Sun, 17 May 2026 12:48:47 UTC (14 KB) Full-text links: Access Paper: View a PDF of the paper titled Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning, by Wu YangView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-139] Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty
链接: https://arxiv.org/abs/2605.17419
作者: Ren Ozeki,Hamada Rizk,Hirozumi Yamaguchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.
[AI-140] Benchmarking Mythos-Linked Bug Rediscovery
链接: https://arxiv.org/abs/2605.17416
作者: Isaac David,Arthur Gervais
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Anthropic’s April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6 tasks. Kimi K2 records 0/18 target rediscoveries. The dominant failure mode is early commitment to plausible alternate candidates within the assigned file: models often submit source-grounded hypotheses while missing the specific invariant corrected by public Mythos patch evidence. These results do not refute Anthropic’s undisclosed workflow, but show that under this favorable target-file scaffold, systems-specific prompting yields only six target matches across 54 counted attempts.
[AI-141] Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
链接: https://arxiv.org/abs/2605.17413
作者: Isaac David,Arthur Gervais
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17413 [cs.CR] (or arXiv:2605.17413v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.17413 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-142] Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design
链接: https://arxiv.org/abs/2605.17410
作者: Ou Wu,Yingjun Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages
Abstract:Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbfComputational Token Economics and propose the \textbfToken Economics Trilemma – a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.
[AI-143] ADR: An Agent ic Detection System for Enterprise Agent ic AI Security
链接: https://arxiv.org/abs/2605.17380
作者: Chenning Li,Pan Hu,Justin Xu,Baris Ozbas,Olivia Liu,Caroline Van,Manxue Li,Wei Zhou,Mohammad Alizadeh,Pengyu Zhang,KK Sriramadhesikan,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted at MLSys 2026 (Industry Track)
Abstract:We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability – existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness – static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs – LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks – outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2–4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.
[AI-144] FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
链接: https://arxiv.org/abs/2605.17373
作者: Qiran Zou,Hou Hei Lam,Wenhao Zhao,Tingting Chen,Yiming Tang,Samson Yu,Yingtao Zhu,Srinivas Anumasa,Zufeng Zhang,Tianyi Zhang,Chang Liu,Zhengyao Jiang,Anirudh Goyal,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Our benchmark is available at: this https URL
Abstract:AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: this https URL.
[AI-145] CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
链接: https://arxiv.org/abs/2605.17370
作者: Qixuan Hu,Shuchang Ye,Xumou Zhang,Anastasia Serafimovska,Anastasia Suraev,Amit Saha,Ping-hsiu Lin,Sydney Su,Usman Naseem,Adam G. Dunn,Jinman Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 2 tables
Abstract:Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient’s voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.
[AI-146] textscMasFACT: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
链接: https://arxiv.org/abs/2605.17361
作者: Xuefei Wang,Jialu Wang,Fengbo Zhang,Yihan Hu,Di Zhang,Yutong Ye,Yikun Ban,Jun Han,Ruijie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emphtopology forgetting, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf\textscMasFACT, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textscMasFACT consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.
[AI-147] LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
链接: https://arxiv.org/abs/2605.17329
作者: Nanxi Li,Zhengyue Zhao,Chaowei Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low-latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG-4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3-4B-Thinking under the single-sample evaluation setup. Code and data are available at this https URL.
[AI-148] ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
链接: https://arxiv.org/abs/2605.17324
作者: Udari Madhushani Sehwag,Zhengyang Shan,Heming Liu,Dileepa Lakshan,Joseph Brandifino,Max Fenkell
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent’s susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous-State Prompt Injection), a benchmark of 728 task-attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool-returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification-seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash. A decomposition analysis reveals that this gap reflects both a state-dependent shift in how models process incoming content and a channel-specific effect arising from the agent-solicited clarification interface. These findings demonstrate that standard execution-time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at this https URL.
[AI-149] Clone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents
链接: https://arxiv.org/abs/2605.17320
作者: Yutong Huang,Vikranth Srivatsa,Alex Asch,Hansin Tushar Patwa,Yiying Zhang
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-use agents increasingly operate inside live personal workspaces, where their actions can modify files, applications, GUI state, credentials, and authenticated sessions. This creates a tension between safety and quality: agents need isolation and rollback to avoid damaging user state, but also need fast branching to support speculative execution and parallel search. Existing VMs, containers, and checkpoint/restore systems can isolate or recover workloads, but they do not provide low-latency versioning of a full interactive workspace. We present TClone, a forkable personal workspace system for computer-use agents. TClone enables a live GUI workspace to be snapshotted, forked into isolated branches, rolled back, and selectively committed or merged. Its design separates fast branch creation from durable checkpointing, using sibling containers, copy-on-write memory sharing, filesystem versioning, GUI-local execution, and asynchronous checkpointing. In our end-to-end agent-loop measurement, TClone reduces total task latency by 1.9x and 1.5x over KVM and CRIU. By making workspace versioning a first-class systems primitive, TClone supports safer and higher-quality agent execution over real personal computing environments. Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17320 [cs.OS] (or arXiv:2605.17320v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2605.17320 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-150] Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement
链接: https://arxiv.org/abs/2605.17316
作者: Keshu Wu,Sixu Li,Zihao Li,Zhiwen Fan,Xiaopeng Li,Yang Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sensor networks increasingly govern modern infrastructure, yet the data they lose are rarely missing in the uniform-random patterns assumed by standard imputation benchmarks. Loop detectors go offline during calibration, roadside cabinets silence clusters of nearby sensors, and newly installed instruments provide no history. Such failures create structured absences whose values are constrained by higher-order relations among groups of sensors, not merely by pairwise proximity. Existing low-rank and graph-based methods often miss this collective structure and can fail when missingness becomes coherent. We introduce Multi-Scale Hypergraph Laplacians (MSHL), a two-stage framework for learning higher-order structure from incomplete spatiotemporal observations. The Discovery stage builds a multi-scale hypergraph from complementary topology and residual-correlation evidence, with an observation-only selector that adapts to the supported interaction scale. The Refinement stage adds a small hypergraph-conditioned residual network that is safe by construction: it learns nonlinear corrections where informative residual features exist and defers to the linear estimate where they do not. We prove that MSHL represents group-conservation patterns inaccessible to pairwise graph priors, adapts to the best fixed scale up to a logarithmic factor, transfers this advantage to held-out imputation error, and admits a one-sided refinement guarantee. On two real traffic networks evaluated across scattered cell missingness, contiguous block outages, and whole-sensor blackouts at five rates, MSHL improves over a pairwise-graph baseline whenever higher-order structure is identifiable and otherwise matches it within sampling noise. The results point to a broader principle for reliable infrastructure learning: missing data should be treated not as isolated entries to fill, but as evidence of structure to discover.
[AI-151] Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
链接: https://arxiv.org/abs/2605.17308
作者: Yang Wu,Xiaoyan Yuan,Hau-San Wong,Xiping Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.
[AI-152] LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
链接: https://arxiv.org/abs/2605.17289
作者: Mohammad Mozaffari,Younes Hourri,Mohammad Rastegari,Mahyar Najibi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel- sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.
[AI-153] When Efficiency Backfires: Cascading LLM s Trigger Cascade Failure under Adversarial Attack
链接: https://arxiv.org/abs/2605.17288
作者: Zehan Sun,Dingfan Chen,Songze Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: under review
Abstract:Large Language Model (LLM) cascade systems are designed to balance efficiency and performance by processing queries with lightweight models while selectively escalating complex cases to more powerful ones. Such systems seek to reduces computational cost and latency while maintaining task performance, making it an appealing choice for large-scale deployment. However, the cascade design introduces new vulnerabilities through an expanded attack surface: the inclusion of lightweight front-end models and internal decision mechanisms introduces new weaknesses. In this work, we present the first study demonstrating that LLM cascade systems are susceptible to targeted adversarial manipulation, which disrupts both performance objectives and the intended cost advantages of the cascade design. We propose a novel attack framework that employs constrained sequential collaborative optimization of adversarial suffix under cascade dependencies, enabling simultaneous exploitation of lightweight models and decision mechanisms. This framework adapts to adversaries with varying capabilities, inducing controllable degradation in both cost-efficiency and accuracy. Unlike prior attacks targeting standalone models, our approach strategically leverages the cascade structure to achieve significantly stronger impact. Extensive experiments across diverse datasets and representative LLM cascade systems validate the practicality and severity of this attack. Our findings highlight the urgent need to rigorously scrutinize the security of LLM cascade systems and call for broader attention to the systemic risks inherent in such designs.
[AI-154] UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models ICLR2024
链接: https://arxiv.org/abs/2605.17285
作者: Hyunju Kang,Geonhee Han,Hogun Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2024
Abstract:Node representation learning, such as Graph Neural Networks (GNNs), has emerged as a pivotal method in machine learning. The demand for reliable explanation generation surges, yet unsupervised models remain underexplored. To bridge this gap, we introduce a method for generating counterfactual (CF) explanations in unsupervised node representation learning. We identify the most important subgraphs that cause a significant change in the k-nearest neighbors of a node of interest in the learned embedding space upon perturbation. The k-nearest neighbor-based CF explanation method provides simple, yet pivotal, information for understanding unsupervised downstream tasks, such as top-k link prediction and clustering. Consequently, we introduce UNR-Explainer for generating expressive CF explanations for Unsupervised Node Representation learning methods based on a Monte Carlo Tree Search (MCTS). The proposed method demonstrates superior performance on diverse datasets for unsupervised GraphSAGE and DGI.
[AI-155] ContractBench: Can LLM Agents Preserve Observation Contracts?
链接: https://arxiv.org/abs/2605.17281
作者: Jicheng Wang,Yifeng He,Zili Wang,Hanwen Xing,Arkaprava De,Hao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact’s bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.
[AI-156] Rover: Context-aware Conflict Resolution with LLM
链接: https://arxiv.org/abs/2605.17279
作者: Qingyu Zhang,Junzhe Li,Jiayi Lin,Changhua Luo,Chenxiong Qian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code merging is a significant challenge, particularly in large-scale projects. Existing solutions, including program analysis and machine learning, show promise but face critical limitations. Program analysis lacks the ability to infer developers’ intentions, relying on conservative strategies that offload unresolved conflicts for manual handling. Meanwhile, model-based approaches struggle with conflicts involving complex code dependencies due to insufficient contextual awareness. To address these gaps, we introduce Rover, a novel conflict resolution system that integrates program analysis with large language models (LLMs). To obtain context-aware prompts, we propose Multi-layer Code Property Graph (MtCPG), a new representation capturing inter-file dependencies and enabling contextual analysis for a given conflict. Using graph connectivity algorithms, Rover further clusters conflicting code and associated changes into meaningful “contexts” that guide the LLM in generating accurate resolutions. We compared Rover with standalone LLMs, machine learning baseline MergeGen, and suggestion provider tool WizardMerge with adjacent code as the contexts. Evaluation results show that Rover surpasses all of these approaches in terms of conflict resolution, achieving higher similarity to ground-truth resolutions at character, lexical, and semantic levels.
[AI-157] A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
链接: https://arxiv.org/abs/2605.17278
作者: Qingchuan Ma,Yuexiao Ma,Yongkang Xie,Tianyu Xie,Xiawu Zheng,Rongrong Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification–testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)–guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.
[AI-158] How Do Electrocardiogram Models Scale?
链接: https://arxiv.org/abs/2605.17276
作者: Jiawei Li,Fabio Bonassi,Ming Jin,Stefan Gustafsson,Johan Sundström,Thomas B. Schön,Antônio H. Ribeiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over 120 models (ranging from 20 K to 200 M parameters) on the large-scale CODE dataset ( 2.3 M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are 1.3 to 2.5 times more parameter-efficient than Transformers, while SSL is up to 16 times more data-efficient and achieves up to 7.6 times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.
[AI-159] State-of-the-Art Claims Require State-of-the-Art Evidence
链接: https://arxiv.org/abs/2605.17273
作者: YongKyung Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA carries implicit assumptions beyond mean score superiority, suggesting that a model meaningfully outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly assumed property of superiority does not hold. These properties include meaningful effect size, consistency across tasks, or robustness to dataset removal. Instead, aggregate gains are frequently driven by outlier datasets. This fragility persists even in benchmarks with many tasks. We argue that claim language should reflect the strength of the underlying evidence. This requires no additional experiments, only honest reporting of what results actually show, enabling more precise and interpretable comparisons across models.
[AI-160] Latency-Aware Deep Learning Benchmark for Real-Time Cyber-Physical Attack and Fault Classification in Inverter-Dominated Power Grids
链接: https://arxiv.org/abs/2605.17256
作者: Emad Abukhousa,Saman Zonouz,A.P. Sakis Meliopoulos
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This work introduces a latency-aware benchmarking framework for evaluating deep learning models in power system anomaly detection using high-fidelity, time-domain signals generated from an industry-grade electromagnetic transient simulator. Eight neural network architectures, ranging from MLPs to Transformers, were systematically evaluated on streaming datasets representing both physical faults and cyber-attacks in inverter-dominated networks. All models successfully classified two representative multi-event sequences in real time with sub-cycle response times below 15 ms. However, although classification decisions occurred within one cycle, the end-to-end inference latency consistently exceeded three cycles, ranging from 50 to 90 ms. These results highlight a critical gap between algorithmic capability and protection-grade deployment, pointing to the need for further optimization and hardware acceleration. The findings establish a reproducible benchmark for sub-cycle anomaly detection and provide guidance for transitioning machine learning methods from research prototypes to real-world protection applications.
[AI-161] CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
链接: https://arxiv.org/abs/2605.17255
作者: Wentao Long,Yunfei Zhang,Chenyi Li,Li Zhou,Chumin Sun,Zaiwen Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Preprint. 44 pages, 7 figures
Abstract:Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.
[AI-162] CatalyticMLLM : A Graph-Text Multimodal Large Language Model for Catalytic Materials
链接: https://arxiv.org/abs/2605.17254
作者: Yanjie Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 page
Abstract:Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predicts target properties from given structures, whereas the latter generates candidate structures according to desired properties. Although the decoupled paradigm facilitates the implementation of a generation--evaluation--screening'' workflow, the inconsistency between the generative model and the property prediction model in terms of representation spaces and training objectives can readily introduce data distribution shifts and evaluator bias, thereby limiting the stability of closed-loop optimization. In this work, we propose QE-Catalytic-V2, a unified graph--text multimodal large language model for catalytic materials, which integrates property prediction and inverse design within the same model and shared representation space. Under this unified framework, QE-Catalytic-V2 can not only perform reliable property prediction by leveraging three-dimensional structures and textual information, but also generate and screen physically feasible CIF candidates conditioned on target properties, thereby forming a closed-loop optimization workflow of inverse design–prediction–screening–redesign.‘’ Experimental results demonstrate that this unified paradigm outperforms decoupled baselines on both catalytic relaxed-energy prediction and inverse design tasks, validating the effectiveness of jointly modeling property prediction and structure generation within a single multimodal model. Comments: 30 page Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17254 [cs.AI] (or arXiv:2605.17254v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.17254 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-163] owards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate
链接: https://arxiv.org/abs/2605.17247
作者: Zheqin Yin,Yupei Ren,Yadong Zhang,Yujiang Lu,Man Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt. In this work, we propose TIDE, a novel framework designed to improve criteria-based prompt optimization for argument-related tasks by integrating TrIal and DEbate mechanism. Our method addresses key limitations of criteria-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification. Results demonstrate that our framework improves performance across tasks. These findings underscore the potential of combining prompt-based methods for advanced argument understanding.
[AI-164] Fidelity Probes for Specification–Code Alignment
链接: https://arxiv.org/abs/2605.17246
作者: Ferhat Erata,Hao Zhou,Luke Huan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 14 figures, 11 tables
Abstract:We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point F^\dagger from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.
[AI-165] Drift Flow Matching
链接: https://arxiv.org/abs/2605.17244
作者: Chenrui Ma,Xi Xiao,Lin Zhao,Tianyang Wang,Ferdinando Fioretto,Yanning Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Iterative generative models such as Flow Matching and Diffusion models have demonstrated strong test-time scaling behavior, where additional inference computation can improve generation quality. In contrast, Drift Models offer efficient one-step generation, but their direct generation paradigm limits such flexibility. In this work, we propose Drift Flow Matching (DFM), a framework that connects drifting generative modeling with flow-based iterative generation. DFM preserves the efficiency of direct transport maps while enabling generation to be refined through multiple inference steps when desired. This bridges the gap between one-step Drift Models and multi-step Flow Matching methods, and provides a novel generative paradigm that can adapt sampling computation to different quality–efficiency requirements. Extensive experiments across different tasks and datasets demonstrate the effectiveness and generality of the proposed framework.
[AI-166] Integration of AI in Cybersecurity: Current Trends with a Focused Look at Intrusion Detection Applications
链接: https://arxiv.org/abs/2605.17219
作者: S. Tazili,A. Mansour,M. Y. Chkouri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: Accepted at AI2SD 2025. Forthcoming in Springer Lecture Notes in Networks and Systems (2026). Please cite this preprint as indicated in the paper!
Abstract:Artificial Intelligence (AI) is widely adopted today for its ability to detect patterns, automate tasks, and reduce time and cost across various applications. Its integration into Cybersecurity has garnered significant attention, particularly in areas such as intrusion detection, malware analysis, and phishing or spam detection. As AI and cybersecurity evolve, new methods and approaches emerge regularly. Current trends include the use of Generative AI, Natural Language Processing, Federated Learning for privacy-preserving collaborative training, and eXplainable AI to ensure interpretability and trust, which are vital in cybersecurity. This paper presents an interesting review of current AI-based cybersecurity trends, focusing on intrusion detection approaches and aiming to uncover meaningful insights through comparative analysis based on the employed AI techniques and reported performance.
[AI-167] Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies
链接: https://arxiv.org/abs/2605.17204
作者: Xinchen Jin,Aditya Chatterjee,Pranav Kumar,Rohan Paleja
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of \pi_0.5 . SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \urlthis https URL.
[AI-168] MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition
链接: https://arxiv.org/abs/2605.17181
作者: Abhimanyu Kaushik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:Learning the violin is harder than it looks. Unlike piano keys or guitar frets, the violin neck has no markings at all, so a beginner cannot tell by looking where to place each finger. MusicSynth is an open-source web tool that tries to fix that: user uploads a photo of any violin sheet music (or a digital score file), and the system automatically produces a video showing a violin fingerboard with each note highlighted at the right moment – no software to install, no manual note entry required. The system connects three existing open-source tools into one pipeline: an optical music recognition (OMR) library reads the notes from the uploaded image, a MusicXML parser extracts timing information from digital scores, and a video renderer draws the fingerboard frame by frame. The only part built from scratch is the lookup table that maps each musical note to a string and finger position on the violin. Tested across 110 public-domain violin scores, MusicSynth correctly identified 91.2,% of notes in clean printed music and assigned the right finger position 99.1,% of the time when given a digital score file. To the author’s knowledge, no freely available tool currently turns a sheet music image into an animated violin fingerboard tutorial automatically and in a single browser-based step. Comments: 12 pages, 4 figures Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.17181 [cs.SD] (or arXiv:2605.17181v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2605.17181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-169] CAREBench: Evaluating LLM s Emotion Understanding by Assessing Cognitive Appraisal Reasoning
链接: https://arxiv.org/abs/2605.17176
作者: Zhaoyue Sun,Hainiu Xu,Andero Uusberg,James J. Gross,Petr Slovak,Yulan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages,18 figures
Abstract:Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs’ true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs’ affective cognitive capabilities.
[AI-170] Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
链接: https://arxiv.org/abs/2605.17174
作者: Shuyin Ouyang,Zhaozhi Qian,Faroq AL-Tam,Muhammad AL-Qurishi,Jie M. Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) is an important paradigm for aligning Diffusion Language Models (DLMs) toward functional correctness in code generation. However, these models often encounter a ``capability cliff’’ on complex tasks, where execution-based semantic rewards become too low to provide a viable learning signal. In this paper, we present a systematic empirical study of RL post-training for diffusion-based code generation along three axes: reward design, hint-conditioned sampling, and task difficulty. We investigate the effectiveness of execution-free rewards as alternatives to traditional unit-test execution, the role of training-time hint-conditioned diffusion sampling in mitigating exploration bottlenecks, and the impact of these design choices varies across tasks with different difficulty levels. Across HumanEval, MBPP, and LiveCodeBench, we find that static checking is the strongest overall standalone execution-free reward in our setting, especially improving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while reducing rollout time by 9.4%. We further find that moderate AST-based hinting is most useful on harder benchmarks, while the best reward design depends strongly on task difficulty: similarity-based rewards are more effective on easier subsets, whereas static checking is more reliable on harder subsets where execution rewards are low. These findings suggest that reward design and training guidance substantially affect diffusion RL performance in our evaluated code-generation setting.
[AI-171] Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
链接: https://arxiv.org/abs/2605.17164
作者: Mengtian Yang,Zhekun Zhang,Mingheng Wu,Jianwen Yan,Hanshi Sun,Li-wen Chang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: Accepted by MLSys 2026
Abstract:Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating “what-if” Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.
[AI-172] STRIDE-AI: A Threat Modeling Framework for Generative AI Security Assessment
链接: https://arxiv.org/abs/2605.17163
作者: Tsafac Nkombong Regine Cyrille,Franziska Schwarz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures , 2 tables, CIIT 2026 23rd International Conference on Informatics and Information Technologies (CIIT)
Abstract:Traditional cybersecurity methodologies target deterministic systems and fail to address the probabilistic nature of AI, leaving systems vulnerable to attack vectors such as model inversion, data poisoning, and prompt injection. Recent industry reports indicate that a majority of organizations deploying AI lack a dedicated security strategy, with adversarial attacks increasing rapidly year-over-year. We present \textitSTRIDE-AI, a framework that bridges the gap between high-level risk standards (NIST AI RMF) and technical vulnerability taxonomies (OWASP LLM Top 10). The framework defines a six-phase assessment lifecycle, introduces a threat modeling adaptation of classical STRIDE for AI systems, and is operationalized through a purpose-built web tool. We provide an initial validation of the approach through a black-box assessment of a deployed LLM chatbot, which successfully reduced the attack success rate from 80% to 15% in our sandbox case study.
[AI-173] From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning
链接: https://arxiv.org/abs/2605.17162
作者: Ján Klačan,Sizhong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 8 figures
Abstract:This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.
[AI-174] Evolutionary Extreme Learning Machine of ab-initio Energy Landscapes for Crystal Structure Prediction using Manta Ray Optimization with Levy Flight
链接: https://arxiv.org/abs/2605.17148
作者: Adrian Rubio-Solis
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures
Abstract:The Manta Ray Foraging Optimization algorithm (MRFO) has proven to be a powerful heuristic strategy in the optimal solution of a large number of engineering problems. In this paper, an improvement of MRFO with Levy Flight is suggested for the training of extreme learning machines (ELMs) whose basic model is a Single Layer Feedforward Network (SLFN). The proposed methodology that we called Evolutionary EELM-MRFO-LF for short is implemented to the prediction of unrelaxed and relaxed formation energy compounds relative to ground state crystal structure of pure components in binary systems. EELM-MRFO-LF follows the learning procedure of traditional Evolutionary ELMs in which first MRFO with LF is used to select the input weights and Moore-Penrose (MP) generalized inverse is applied to analytically determine the output weights. Levy Flight trajectory is implemented for increasing the diversity of the population of ELMs against premature convergence and the ability of avoiding getting trapped in a local optima. The performance of the suggested EELM-MRFO-LF is compared with other well-known nature-inspired algorithms under similar conditions.
[AI-175] Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States NEURIPS2026
链接: https://arxiv.org/abs/2605.17144
作者: Miranda Muqing Miao,Subin Kim,Brandon Yang,Lyle Ungar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to NeurIPS 2026
Abstract:Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a “conceptor”, a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert’s decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own “success” distributions.
[AI-176] Dynamics of collective creativity in AI art competitions
链接: https://arxiv.org/abs/2605.17141
作者: Mason Youngblood,Jeff Nusz,Joel Simon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily “remix parties” where users iteratively build on each other’s work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic “attractors” (e.g., steampunk scenes, alien architecture). We also find that while more novel “parent” images produce more novel and complex “children” that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.
[AI-177] Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design
链接: https://arxiv.org/abs/2605.17137
作者: Cheikh Ahmed,Mahdi Mostajabdaveh,Zirui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at LION 2026, The Learning and Intelligent Optimization Conference
Abstract:The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non-convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient-based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent-space optimization achieves performance competitive with state-of-the-art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \urlthis https URL.
[AI-178] New Wide-Net-Casting Jailbreak Attacks Risk Large Models ICML2026
链接: https://arxiv.org/abs/2605.17128
作者: Qiuchi Xiang,Haoxuan Qu,Hossein Rahmani,Jun Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026; project page at this https URL
Abstract:Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.
[AI-179] F2IND-IT! – Multimodal Fuzzy Fake Indian News Detection using Images and Text
链接: https://arxiv.org/abs/2605.17115
作者: Kushal Trivedi,Murtuza Shaikh,Khushi Singh,Jeevaraj S.
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure
Abstract:Biased manipulation of facts across regional and national media outlets complicates misinformation detection in diverse landscapes like India. This paper introduces a novel multimodal framework combining visual and textual modalities for enhanced fake news detection on Indian media. The architecture utilizes a ResNet-50 Convolutional Neural Network to extract visual features from news images, a DistilBERT encoder to obtain textual semantic embeddings, and an Adaptive Neuro-Fuzzy Inference System (ANFIS) to generate a fuzzy reliability score. A lightweight attention-based fusion module assigns learnable weights to each modality prior to classification. Evaluated on the IFND dataset, the proposed model is validated through an in-depth comparative analysis against previous research. Experimental results demonstrate superior performance across accuracy, precision, recall, and F_1 -scores, confirming the efficacy of the architecture.
[AI-180] Capturing LLM Capabilities via Evidence-Calibrated Query Clustering
链接: https://arxiv.org/abs/2605.17110
作者: Fangzhou Wu,Sandeep Silwal,Qiuyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 45 pages
Abstract:Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.
[AI-181] DynMuon: A Dynamic Spectral Shaping View of Muon
链接: https://arxiv.org/abs/2605.17109
作者: Fangzhou Wu,Rikhav Shah,Sandeep Silwal,Qiuyi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix M=U\Sigma V^\top with its polar factor UV^\top . In this work, we consider a class of Muon-like updates, where we replace the update M with U\Sigma^p V^\top for some parameter p . We call this a “spectral-shaping” operation, and develop a theory of how to pick p which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and © training stage. Our theory and experimentation reveal a previously overlooked behavior: positive p helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative p helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules p from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.
[AI-182] Scientific Logicality Enriched Methodology for LLM Reasoning : A Practice in Physics ICML2026
链接: https://arxiv.org/abs/2605.17104
作者: Zhaoxin Yu,Nan Xu,Kun Chen,Jiahao Zhao,Lei Wang,Wenji Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs’ performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process – logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \hrefthis https URLthis https URL.
[AI-183] How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
链接: https://arxiv.org/abs/2605.17077
作者: Bosung Kim,Ruiyi Wang,David Acuna,Jaehun Jung,Alexander Trevithick,Brandon Cui,Yejin Choi,Prithviraj Ammanabrolu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.
[AI-184] AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
链接: https://arxiv.org/abs/2605.17071
作者: Shiying Yu,Jielei Wang,Guoming Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.
[AI-185] owards Human-Level Book-Writing Capability
链接: https://arxiv.org/abs/2605.17064
作者: Jan Zierstek,Matteo Batelic,Maya Medjad,Tim Schönenberger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures
Abstract:Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.
[AI-186] PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models ACL2026
链接: https://arxiv.org/abs/2605.17044
作者: Wenlong Shi,Jianxun Lian,Mingqi Wu,Haiming Qin,Mingyang Zhou,Xing Xie,Naipeng Chao,Hao Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs’ role-playing capabilities, advancing the development of more authentic and socially adept AI agents.
[AI-187] Evidential Information Fusion on Possibilistic Structure
链接: https://arxiv.org/abs/2605.17038
作者: Qianli Zhou,Ye Cui,Zhen Li,Witold Pedrycz,Yong Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dempster’s rule is a fundamental tool for combining belief functions from distinct and reliable sources. However, its intersection-based semantics imposes strong structural restrictions, which limits its flexibility in handling complex source states and diverse information fusion scenarios. To overcome this limitation, we propose a reversible transformation, derived from the isopignistic principle, between belief functions and a possibilistic structure defined on the power set. In this transformation, the relationships among subsets are explicitly characterized by a belief evolution network, which provides a more flexible representation of evidential information beyond the conventional mass function structure. On this basis, we further introduce the triangular norm family to develop a general and adaptive evidential information fusion framework. Unlike fusion methods rooted in Dempster semantics, the proposed framework supports more flexible combination behaviors and exhibits advantages in non-distinct source fusion, conflict management, parametric combination design, and heterogeneous information fusion.
[AI-188] Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2605.17034
作者: Osama Zafar,Alexander Nemecek,Yiqian Zhang,Wenbiao Li,Debargha Ganguly,Vikash Singh,Vipin Chaudhary,Erman Ayday
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2605.17034 [cs.LG] (or arXiv:2605.17034v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-189] ask Abstention for Large Language Models in Code Generation
链接: https://arxiv.org/abs/2605.17029
作者: Yanke Zhou,Yuhao Tan,Senrong Xu,Zenan Li,Yuan Yao,Taolue Chen,Xiaoxing Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures
Abstract:Large language models (LLMs) have revolutionized automated code generation. One serious concern, however, is the so-called ``hallucination’', i.e., LLMs may generate seemingly plausible but functionally incorrect code. In this paper, we study the task abstention problem, i.e., determining whether a given LLM should abstain from performing a specific code generation task to avoid likely hallucination. Our approach features a calibrated abstention rule, grounded in the principles of multiple hypothesis testing. The rule assesses generation consistency through code execution outcomes, allowing it to handle syntactic diversity of semantically equivalent code without reliance on oracle test cases or external databases. We prove that our approach provides a rigorous, distribution-free theoretical guarantee on its abstention decisions. We evaluate our method on benchmark datasets using several open-source code LLMs. Results show that our method allows generative models to more accurately and efficiently identify and abstain from tasks that induce hallucination compared to existing techniques, providing a reliable mechanism for safer and more robust code generation.
[AI-190] A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification
链接: https://arxiv.org/abs/2605.17021
作者: Yunzhi Tian,Dekui Wang,Qirong Bu,Wei Zhou,Xingxing Hao,Jun Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures
Abstract:Multi-view learning has been widely applied for sleep stage classification using multi-modal data. However, existing methods typically assume that different modalities are well-aligned, which is often unattainable in real-world scenarios, thereby compromising the reliability of the staging results. In this paper, we propose ConfSleepNet, a conflict-aware evidential framework that dynamically resolves inter-view conflicts. The framework consists of multi-view evidence extraction and conflict-aware aggregation. In the first phase, it learns category-related evidence from different modalities, which represents the degree of support for individual sleep stages. Considering the inherent characteristics of varying modalities, we propose hybrid category structures for different modalities to promote more reasonable evidence learning. In the second phase, view-specific opinions, including prediction results and uncertainty, are constructed from the learned evidence. Notably, we propose a novel conflict-aware aggregation method that integrates these view-specific opinions into a reliable joint decision. This mechanism can effectively resolve conflicts among opinions and synthesize them into a reliable joint decision. Both theoretical analysis and experimental results demonstrate the effectiveness of ConfSleepNet in sleep staging tasks. The code is available at this https URL.
[AI-191] When Dynamics Shift Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited
链接: https://arxiv.org/abs/2605.17017
作者: Rishabh Agrawal,Rahul Jain,Ashutosh Nayyar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.
[AI-192] Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
链接: https://arxiv.org/abs/2605.17003
作者: Peng Cui,Boyao Yang,Jun Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model’s current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model’s active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at this https URL.
[AI-193] BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
链接: https://arxiv.org/abs/2605.17000
作者: Ruth Wan Theng Chew,Zhiliang Chen,Apivich Hemachandra,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at this https URL. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.
[AI-194] Adversarial Frag ility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings
链接: https://arxiv.org/abs/2605.16993
作者: Anthonio Oladimeji Gabriel,Ahmad Rufai Yusuf
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 9 figures, 3 tables. Code and data available at this https URL
Abstract:Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.
[AI-195] Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons
链接: https://arxiv.org/abs/2605.16975
作者: Wei Tang,Jinpei Han,Kangning Cui,Mattia Carletti,Fredrik K. Gustafsson,Shreyank N Gowda,Patitapaban Palo,Anshul Thakur,Lei Clifton,Jean-michel Morel,Raymond H. Chan,David A. Clifton,Xiao Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.
[AI-196] Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms
链接: https://arxiv.org/abs/2605.16969
作者: Anni Zhao,Alex Bateh,Tyler Baldridge,Sandra Billinger,Xiao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Defining vascular age in terms of physiological function has become one focal point of the extensive studies to categorize and track chronological age. Transcranial Doppler (TCD) is a method by which cerebral blood flow velocity is measured along the major arteries feeding the human brain. This study aims to use features extracted from TCD to estimate chronological age and assess accelerated aging in subjects with various brain diseases. We predict subjects with various brain diseases to present with accelerated cerebrovascular aging when tested on various regression models trained by healthy subjects. 168 healthy subjects and 277 diseased subjects with bilateral TCD recordings of the middle cerebral artery were analyzed using the Morphological Analysis and Clustering of Intracranial Pressure (MOCAIP) algorithm. MOCAIP-generated features and heart rate variability features were used as input features for regression models to predict the brain vascular age. 66 subjects with acute stroke, 27 subjects with post stroke, 26 subjects with Alzheimer’s disease, 23 subjects with mild cognitive impairment, and 135 established subjects were tested against the machine learning model to assess for accelerated cerebrovascular age. The trained model, on average, predicted healthy subjects’ cerebrovascular age to be 3.69 years above their chronological age. Subjects with different disease conditions exhibited varying levels of age acceleration. The differences in healthy and diseased subjects’ performances suggest that features generated using TCD may be relevant when evaluating accelerated cerebrovascular aging. Moreover, imbalanced datasets have been observed to affect the performance of machine-learning-based brain age prediction models.
[AI-197] Harnessing AI for Inverse Partial Differential Equation Problems: Past Present and Prospects
链接: https://arxiv.org/abs/2605.16966
作者: Zhentao Tan,Yuze Hao,Boyi Zou,Mingsheng Long,Yi Yang,Gang Bao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures
Abstract:Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.
[AI-198] From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction
链接: https://arxiv.org/abs/2605.16927
作者: Pujun Feng,Xiaoyu Guo,Seyed Ehsan Saffari,Min Hun Lee,Siew-Kei Lam,Erik Cambria,Xibin Sun,Yangtao Zhou,Tong Yang,Xiaoyu Zhang,Tao Tan,Yue Sun,Bin Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians’ measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI–methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.
[AI-199] OBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
链接: https://arxiv.org/abs/2605.16909
作者: Zhiqiang Liu,Wenhui Dong,Yilang Tan,Yuwen Qu,Haochen Yin,Chenyang Si
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Github: this https URL
Abstract:Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.
[AI-200] NGM: A Plug-and-Play Training-Free Memory Module for LLM s
链接: https://arxiv.org/abs/2605.16893
作者: Yuwen Qu,Wenhui Dong,Chenyang Si,Caifeng Shan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).
[AI-201] Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
链接: https://arxiv.org/abs/2605.16880
作者: Sha Tao,Jiao Pan,Yu Guo,Chao Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026
Abstract:Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.
[AI-202] Reasoning Can Be Restored by Correcting a Few Decision Tokens ICML2026
链接: https://arxiv.org/abs/2605.16874
作者: Changshuo Shen,Leheng Sheng,Yuxin Chen,An Zhang,Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty – suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at this https URL.
[AI-203] Some[Body] Must Receive That Pain for Agent Accountability
链接: https://arxiv.org/abs/2605.16872
作者: Botao Amber Hu,Helena Rong
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents increasingly act consequentially in the real world. This creates a problem we call \emphconsequence reception: harm occurs, the producing system is identified, yet no continuing agent receives consequences in a way that changes future behavior. Pain, understood mechanistically as a corrective feedback signal, is foundational to canonical theories of punishment – deterrence, rehabilitation, retribution, and incapacitation all assume a continuing locus that registers the signal and updates behavior. That, in turn, requires a body for the signal to land on: a boundary whose integrity it protects, a locus where it accumulates, consolidation that converts episodic signal into durable update, and a substrate that responds by altering future action. Current LLM agents – software-defined composites of weights, prompts, tools, memory, and credentials, freely swapped, copied, reset, and reassembled – satisfy none of these conditions. The two prevailing legal responses therefore fail to achieve consequence reception. The thin-identity agent-principal dyad has a body but no \emphconsequence–agency coupling: the human bears pain for behaviors beyond their control – Elish’s \emphmoral crumple zone. The thick-identity Arbel et al.'s \emphAlgorithmic Corporation creates legally legible entities but does not guarantee that any AI decision architecture receives pain as a behavioral signal. Achieving consequence-agency coupling is therefore a sociotechnical infrastructural problem, not only a legal one. Until such architectures exist, high-stakes AI deployment should remain tethered to accountable human principals with meaningful control, proportional liability, and authority to constrain or terminate the agent. \emphIf some body does not receive the pain by design, some body will receive it by default.
[AI-204] Plan First Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning
链接: https://arxiv.org/abs/2605.16863
作者: Yaniv Hassidof,Adir Morgan,Yilun Du,Kiril Solovey
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph – serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: this https URL
[AI-205] PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes
链接: https://arxiv.org/abs/2605.16860
作者: Phat Tran,Neville Mehta,Clara Mosquera-Lopez,Robert H. Dodier,Lizhong Chen,Peter G. Jacobs
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.
[AI-206] Pedestrian-Aware LLM -Driven Behavioral Planning for Autonomous Vehicles ITSC
链接: https://arxiv.org/abs/2605.16858
作者: Aidana Baimbetova,Haruki Yonekura,Hamada Rizk,Hirozumi Yamaguchi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the 29th IEEE International Conference on Intelligent Transportation Systems (ITSC)
Abstract:Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.
[AI-207] Learning to Learn from Multimodal Experience
链接: https://arxiv.org/abs/2605.16857
作者: Xingyu Sui,Weixiang Zhao,Yongxin Tang,Yanyan Zhao,Yang Wu,Dandan Tu,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, existing approaches are predominantly developed in textual settings and rely on manually designed memory schemas, limiting their applicability to multimodal environments. In real-world scenarios, experience is inherently multimodal, involving heterogeneous signals across perception, reasoning, and action, which makes effective memory design significantly more challenging. In particular, the optimal way to structure and utilize multimodal experience is highly task-dependent and evolves over time, rendering fixed memory designs insufficient. In this work, we propose a new paradigm, learning to learn from multimodal experience, which shifts memory design from a predefined component to an adaptive and learnable process. Our framework enables agents to dynamically construct, organize, and utilize memory based on task requirements and interaction history, effectively learning how to structure experience for improved performance. Experiments demonstrate that adaptive memory design substantially enhances agent performance and generalization across multimodal tasks, highlighting the critical role of learning memory mechanisms in experience-driven learning.
[AI-208] Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence
链接: https://arxiv.org/abs/2605.16844
作者: Boris Kriuk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Between the narrow systems we deploy and the general intelligence we speculate about lies an entire regime of machine behavior that has never received its own name. This monograph argues that this regime is not empty: it is where meta-learning, neural architecture search, AutoML, continual learning, evolutionary computation, and physics-informed modeling have quietly converged on a common principle, namely the steady removal of the human from the loop of parameter specification. We name this regime Artificial Adaptive Intelligence (AAI) and define it operationally: a system exhibits AAI to the extent that it requires no human-specified tunable hyperparameters while maintaining competitive performance across a diverse distribution of tasks. To make the definition quantitative, we introduce an adaptivity index that measures progress along an axis orthogonal to scale, combining the fraction of hyperparameters absorbed by the system with the performance ratio against a task-specialized baseline. We develop the principle of parametric minimality and ground it in the minimum description length framework, showing that the appropriate hyperparameter count is data-determined rather than designer-determined. We then organize the field around three pathways to minimality: data- and task-aware configuration, structural and evolutionary morphing, and in-training self-adaptation. We analyze their stability, convergence, and governance implications, and illustrate them through case studies spanning aerospace design, financial regime detection, turbulence modeling, ecological dynamics, and vision-language systems. The thesis is that the path from ANI to AGI passes through AAI, and that naming this stage changes what we measure, what we build, and what we call a success.
[AI-209] Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
链接: https://arxiv.org/abs/2605.16842
作者: Siqi Luo,Jianghan Shen,Yi Xin,Huayu Zheng,Haoxing Chen,Yan Tai,Yue Li,Junjun He,Yihao Liu,Guangtao Zhai,Yuewen Cao,Xiaohong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
[AI-210] Voices in the Loop: Mapping Participatory AI
链接: https://arxiv.org/abs/2605.16827
作者: Rashid Mushkani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25–28, 2026, Montreal, QC, Canada
Abstract:Participatory approaches to artificial intelligence are increasingly documented across public, civic, and humanitarian settings, but evidence about how participation is organized remains fragmented. This paper reports on the construction of an open repository and interactive atlas of participatory AI initiatives, using records harmonized from Maga~na and Shilton’s Trustworthy AI corpus, and additional audited cases from research and practice. We contribute three elements. First, we specify a reproducible protocol for discovery, vetting, harmonization, geocoding, provenance tracking, and release-based publication of participatory AI records. Second, we report corpus-level patterns in geography, participation tiers, lifecycle loci, organizational form, verification status, and remaining documentation gaps. Documented initiatives remain concentrated in a small number of countries, while participation is most often coded at problem formulation, evaluation, and governance rather than model development or training. Third, we show how the atlas operationalizes a design and governance framework for participatory-by-default AI infrastructures through versioned releases, record-linked issue and annotation channels, schema feedback workflows, and redaction or restricted-disclosure requests. The atlas is intended to support comparative research, policy learning, and community scrutiny through a living inventory that can be updated, contested, and reused.
[AI-211] Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator ReAct Loopand Adversarial Evaluation in the buddyMe Framework
链接: https://arxiv.org/abs/2605.16821
作者: Xiaohua Wang,Chao Han,Kai Yu,XiaoLiang Xu,Liang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 tables
Abstract:The rapid evolution of Large Language Model (LLM) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi-Agent Orchestration (Generator-Evaluator), ReAct Tool-Use Loops, and Memory-Augmented Interaction, as implemented in buddyMe, an open-source multi-model agent programming framework. We formalize a five-stage processing pipeline: Requirement Pre-Review - Task Decomposition - ReAct Execution - Real-Execution Verification - Adversarial Evaluation Discussion, and establish a six-dimensional evaluation schema with weighted scoring. Through four empirical case studies drawn from real-world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions. First, Generator-Evaluator pre-review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations. Third, adversarial Evaluator-Defender discussions reach consensus within 2-3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal. We additionally provide three Mermaid-based architectural diagrams and conduct cross-paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A-Mem across six system dimensions. The research outcomes offer practical design guidelines for constructing stable and reliable multi-paradigm agent systems.
[AI-212] Cross-Domain Molecular Relational Learning: Leverag ing Chemical Structure-Activity Analysis KDD2026
链接: https://arxiv.org/abs/2605.16799
作者: Peiliang Zhang,Jingling Yuan,Shiqing Wu,Mengqing Hu,Chao Che,Yongjun Zhu,Lin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by SIGKDD 2026 Research Track
Abstract:Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.
[AI-213] Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning ICML2026
链接: https://arxiv.org/abs/2605.16776
作者: Puning Yang,Junchi Yu,Qizhou Wang,Philip Torr,Bo Han,Xiuying Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2026 Accepted
Abstract:Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ( \mathrmD^2 ), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement \mathrmD^2 , we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of \mathrmD^2 . Our code is available at this https URL.
[AI-214] Learning Unbiased Permutations via Flow Matching
链接: https://arxiv.org/abs/2605.16755
作者: Yimeng Min,Carla P. Gomes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning permutations is fundamental to sorting, ranking, and matching, but existing differentiable methods based on entropy-regularized Sinkhorn produce a single softened solution and collapse under ambiguity. We present PermFlow, a conditional flow matching framework that operates directly on the affine subspace of matrices with unit row and column sums. A closed-form tangent-space projector preserves these constraints exactly along every trajectory, by construction rather than through iterative correction, and a nearest-target coupling routes distinct noisy initializations toward distinct valid permutations. The result is a model that captures multimodal permutation distributions rather than collapsing them to a single mode. On a visual sorting task with blended-digit ambiguity and a symmetric linear assignment problem, PermFlow achieves high accuracy on unambiguous inputs and recovers both valid permutations under ambiguity, where Sinkhorn-based baselines structurally fail.
[AI-215] State Contamination in Memory-Augmented LLM Agents
链接: https://arxiv.org/abs/2605.16746
作者: Yian Wang,Agam Goyal,Yuen Chen,Hari Sundaram
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.
[AI-216] Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents
链接: https://arxiv.org/abs/2605.16728
作者: Hongju Pae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a minimal architecture for body-grounded perspective formation in artificial agents. Extending prior work, the model introduces an interoceptive viability signal, a Fisher-style metric over fused exteroceptive-interoceptive states, and a conative alignment mechanism linking bodily tendency to action readiness. In a reward-free gridworld, conation converts learned bodily tendency into stable body-directed behavior, while body-to-perspective routing allows bodily perturbations to leave a recoverable geometric residue in the perspective latent. This study shows how minimal structural conditions for artificial subjectivity can be operationalized in the phenomenological sense, through the embodied organization of how a world is given to an agent.
[AI-217] PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
链接: https://arxiv.org/abs/2605.16727
作者: Roger Creus Castanyer,Geoffrey Bradway,Lorenz Wolf,Maxwill Lin,Augustine N. Mavor-Parker,Matthew James Sargent
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.
[AI-218] A Global-Local Graph Attention Network for Traffic Forecasting
链接: https://arxiv.org/abs/2605.16726
作者: Tianchi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic forecasting is a significant part of intelligent transportation systems. One of the critical challenges of traffic forecasting is to find spatio-temporal correlations. In recent years, graph convolutional networks and graph attention networks have replaced traditional statistical models to predict future traffic. However, it is complicated for both of them to allow vertices to have far different characters. To address this, we propose the Global-Local Graph Attention Network (GLGAT) with pairwise encoding and the event-based adjacency matrix. The GLGAT allows vertices to have a global attention matrix set for the whole graph and assigns local attention matrix sets to each vertex. Experiments on two real-world traffic datasets show that GLGAT can effectively capture spatio-temporal correlations and has competitive performance against other state-of-the-art baselines.
[AI-219] Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models
链接: https://arxiv.org/abs/2605.16725
作者: SeungWon Seo,DongHeun Han,SeongRae Noh,HyeongYeop Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment’s transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.
[AI-220] GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction
链接: https://arxiv.org/abs/2605.16714
作者: Liangyi Huang,Zichen Liu,Fei Shao,Shang Ma,Mengshi Zhang,Zihao Chen,Yanfang Ye,Xusheng Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.
[AI-221] EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
链接: https://arxiv.org/abs/2605.16692
作者: Thomas Evers,Cristian Meo,Wendelin Bohmer,Justin Dauwels,Yaniv Oren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.
[AI-222] Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment
链接: https://arxiv.org/abs/2605.16676
作者: Deniz Askin,Gal Hadar,Brendan Conway-Smith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Metacognition-the ability to monitor one’s own knowledge state, spot gaps, and autonomously fill them–remains largely absent from modern AI. Here, we present MetaKGEnrich, a fully automated pipeline that endows large language model (LLM) applications with self-directed knowledge repair. The system (i) builds knowledge graphs from a seed query, (ii) detects sparse regions via seven graph metrics, (iii) has GPT-4o generate targeted questions, (iv) retrieves web evidence with Tavily and ingests it into Neo4j, and (v) re-answers the query with GraphRAG for GPT-4 to evaluate improvement. Tested on 30 queries from each of three widely-used datasets: Google Research Natural Questions, MS MARCO, and Hot-potQA. MetaKGEnrich improved answer quality in 80% of HotpotQA questions, 87% of Google Research Natural Questions and 83% of MS MARCO questions, while preserving well-supported regions. This proof of concept demonstrates how topological self-diagnosis plus targeted retrieval can advance AI toward humanlike metacognitive learning.
[AI-223] LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning NEURIPS2026
链接: https://arxiv.org/abs/2605.16675
作者: Shradha Agarwal,Deepak Rajbhar,Tariq J
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 42 pages, 3 figures, 12 tables. NeurIPS 2026 Evaluations and Datasets Track submission. Dataset: this https URL
Abstract:We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors – sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.
[AI-224] GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance
链接: https://arxiv.org/abs/2605.16668
作者: Roman Bresson,Konstantinos Divriotis,Johannes F. Lutzeyer,Iakovos Evdaimon,Michalis Vazirgiannis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce GraViti, a transformer-based graph-level variational autoencoder that maps entire graphs to compact latent vectors. This design produces a true graph-level latent space that supports smooth interpolation, property-guided search, and other downstream tasks beyond the constraints of node-level embeddings. On molecular benchmarks, GraViti learns to decode valid samples that follow the chemical constraints present in the training data, showing that the model recovers domain rules directly from graph-level representations. We also show that, in domains where a reliable canonical node ordering exists such as molecules or bayesian networks, enforcing permutation invariance can prove detrimental for consistent reconstruction. GraViti achieves state-of-the-art reconstruction accuracy on large datasets, and provides solid generative performance. Its single-step decoding offers a lightweight alternative to more complex generation pipelines while maintaining practical sample quality.
[AI-225] E-Flash: Accelerating Reasoning -based Multimodal Representations via Think-Then-Embed Tokens
链接: https://arxiv.org/abs/2605.16638
作者: Jianpeng Cheng,Xian Wu,Jiangfan Zhang,Wentao Bao,Chaitanya Ahuja,Shlok Kumar Mishra,Hanchao Yu,Yang Gao,Fan Xia,Qi Guo,Shaodan Zhai,Xiangjun Fan,Jun Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an eos embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.
[AI-226] Learning How to Cube
链接: https://arxiv.org/abs/2605.16632
作者: Ferhat Erata,Sam Kouteili,Thanos Typaldos,Timos Antonopoulos,Robert B. Jones,Byron Cook,Ruzica Piskac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 33 pages, preprint
Abstract:Despite the effectiveness of Cube-and-Conquer (CC) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.
[AI-227] textscPrivScope: Task-scoped Disclosure Control for Hybrid Agent ic Systems
链接: https://arxiv.org/abs/2605.16630
作者: Shafizur Rahman Seeam,Zhengxiong Li,Zhiyuan Yu,Yimin(Ian)Chen,Yidan Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Hybrid local–cloud agents enrich user requests with context from persistent working state before delegating capability-intensive subtasks to a cloud language model (CLM). While this enrichment can improve task success, it also exposes unnecessary information in the cloud-bound payload, including task-irrelevant context, carryover from prior workflows, and overly specific sensitive details, resulting in \emphover-disclosure. Existing solutions either isolate workflows to limit cross-workflow leakage or apply general-purpose sanitization that does not reason over LC-assembled payload scope. We present \textscPrivScope, a trusted on-device payload governor that enforces \emphtask-scoped disclosure at the local–CLM boundary, without requiring cloud-side changes. Its key idea: sensitive information should reach the cloud only when required for the delegated subtask, and then only in the least revealing form preserving utility. \textscPrivScope extracts disclosure units from the assembled payload and keeps direct identifiers and account-linked values on device. The remaining units pass through cloud-necessity control, which determines what is actually needed; units that must reach the cloud are abstracted to the least-specific representation sufficient for the task. On 100 medical-booking workflows across three commercial CLMs, \textscPrivScope eliminates profile leakage (0.0% vs.\ 17.7%), more than halves attacker re-identification (23.1% vs.\ 64.3%), and achieves the highest candidate recall on every CLM tested while preserving task success close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash. Gains hold across five local backbones and add only seconds of on-device latency on commodity hardware. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16630 [cs.CR] (or arXiv:2605.16630v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.16630 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-228] SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors
链接: https://arxiv.org/abs/2605.16626
作者: Elle Najt,Colin Toft,Tyler Tracy,Fabien Roger,Joe Benton
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Since autonomous coding agents generate complex behaviors at high-volume, we may want to use other LLMs to monitor actions to reduce the risk from dangerous misaligned behavior. To better understand the limitations of such monitors against the diverse attack strategies that a coding agent could use, we present SLEIGHT-Bench (Subtle Low-itEration Insight-Guided Harmful Transcripts), a benchmark of synthetic transcripts containing 40 attacks across 11 categories, each showing a coding agent covertly pursuing a harmful objective (e.g. weight exfiltration, credential theft, unmonitored rogue deployment). We find that 24 of 40 attacks are never caught by an Opus 4.6 monitor with extended thinking across 10 trials at a 1% false-positive rate, while the overall catch rate is 23%. We identify several evasion strategies such as exploiting model priors, user instruction ambiguity, and state manipulation, which facilitate creating attacks that evade frontier monitors. We also elicit stronger monitor performance using coding agents as monitors versus regular prompted monitors, and for some evasion strategies show improved catch rates with targeted monitor prompts. Our dataset and evaluation framework are available at this https URL and this https URL.
[AI-229] o Trust or Not to Trust: Authors Response to AI-based Reviews
链接: https://arxiv.org/abs/2605.16623
作者: César Leblanc,Lukas Picek
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly discussed and used as tools that may assist with scholarly peer review, but empirical evidence regarding how authors use and perceive AI-based feedback remains limited. This paper reports findings from two independent pilot studies on authors’ use and perceptions of AI-based auxiliary review at two computer science venues. After the review release, authors were invited to complete an anonymous post-review questionnaire about the AI review’s usefulness, trustworthiness, agreement with human reviews, practical value for revision, perceived inaccuracies, and consent. The final dataset included 56 analyzable responses from authors of 40 papers; closed-ended items were summarized using descriptive statistics, and open-ended responses were analyzed using inductive thematic analysis. Most respondents (83.9%) considered the AI-based review useful, and 80.4% reported that it identified issues not mentioned by human reviewers. This perceived added value translated into action: 82.1% reported using at least some AI feedback in their camera-ready version. However, the authors did not treat the AI review as equivalent to a human review. They generally trusted it less than the human reviews and found human feedback clearer, even though 25.0% described at least some human reviews as not very useful. Reported problems with the AI review were usually limited: 51.8% reported minor inaccuracies, while 16.1% reported clearly incorrect, misleading, or irrelevant comments. Support for future use was strongest when AI was framed as a supervised or author-controlled tool: 96.4% said they would use AI as an internal review tool before future submissions, 89.3% preferred advance notice that AI would be used in review, and 76.8% favored explicit consent before use.
[AI-230] PRISMat: Policy-Driven Permutation-Invariant Autoregressive Material Generation NEURIPS2026
链接: https://arxiv.org/abs/2605.16612
作者: Claire Schlesinger,Circe Hsu,Peter Schindler,Robin Walters
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: 10 pages, 8 figures, Under Review at Neurips 2026
Abstract:Rapid identification of candidate materials with target properties has become a key task in materials science. Machine learning has emerged as an alternative to physics-based simulation, offering a faster and cheaper way to filter materials based on their stability and other target properties, reducing the number of candidates that reach the costly synthesis stage. Recently, Large Language Models (LLMs) have been applied to this role, but these models are parameter-heavy and computationally expensive both during training and at inference time, making them unsuitable for high-throughput tasks. This inefficiency stems from both the large over-parameterization of language models and the difficulty of framing material generation as a sequence learning problem. In this paper, we present PRISMat, a cost-effective, permutation-invariant model, which addresses these limitations. We show that PRISMat, despite taking less time for inference, is able to outperform LLMs in generating crystal slabs conditioned on critical materials’ surface properties. In targeted material discovery, we achieve mean absolute errors of 0.188 eV/A ^2 and 2.79 eV for cleavage energy and work function tasks, respectively, reducing the error of the next best model by 4 \times .
[AI-231] How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning ICML2026
链接: https://arxiv.org/abs/2605.16591
作者: Entang Wang,Yiwei Wang,Aleksandra Bakalova,Michael Hahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026. 70 pages, 65 figures
Abstract:In-context learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts shape a model’s function vector (FV)–a causal activation direction that drives task behavior on the ICL query. Across tasks and models, an n -shot FV is well-approximated by a linear combination of example-level sub-FVs, suggesting additive and composable contributions from individual demonstrations. Beyond additivity, we show that models contextualize individual examples’ representations based on prior examples to adaptively reweight which demonstrations dominate the FV: attention shifts toward examples that are more informative and less ambiguous under the context. Finally, a causal decomposition separates Query-Key routing from Value updates, finding that contextualization’s most consistent contributions to FV quality arise from Query-Key alignment–particularly in ambiguous settings–while Value-mediated effects are more heterogeneous. Together, these results unify additive superposition with context-dependent attention reweighting into a mechanistic, testable account of how few-shot prompts implement tasks.
[AI-232] Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators
链接: https://arxiv.org/abs/2605.16575
作者: Romain Cosentino,Sarath Shekkizhar,Adam Earle,Silvio Savarese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Negotiation requires more than inferring what the other side wants: it requires using that information to make advantageous offers and counteroffers over multiple turns. We study whether large language model (LLM) agents do this in a controlled multi-attribute bargaining environment. We find that current LLM agents can model a counterparty’s preferences, but do not reliably turn that knowledge into strategic bargaining. When given negotiating partner preference information, agents model it accurately and early in their reasoning traces, yet this does not reliably improve outcomes for the informed side. Turn-level analyses show why: agents often respond to what they believe the counterparty values, but do not consistently pair those moves with gains on their own high-value attributes. Sellers are more accommodating overall, and in asymmetric-information conditions, the informed side often makes the more weakly compensated concessions. Because agents fail to leverage this underlying utility structure for strategic advantage, their final agreements are heavily dictated by surface-level opening anchors rather than actual utility weights. Finally, requiring agents to explicitly state concession-for-reciprocity trades before making an offer makes individual turns look more strategic, but ultimately fails to improve the efficiency of the final agreements.
[AI-233] Wavelet Flow Matching for Multi-Scale Physics Emulation
链接: https://arxiv.org/abs/2605.16573
作者: Gabriele Accarino,Juan Nathaniel,Carla Roesch,Pierre Gentine,Sara Shamekh,Duncan Watson-Parris,Viviana Acquaviva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed predictions, while generative approaches better capture details but are costly. Latent-space generative models have emerged as a compromise but with the additional cost of separately pre-trained autoencoders. We propose Wavelet Flow Matching (WFM), a novel generative emulator that overcomes current trade-offs between cost and skill by performing optimal-transport directly in the multi-scale wavelet space. Rather than learning a latent compression, WFM leverages the hierarchical structure of a U-Net to jointly predict transport velocities of a prescribed wavelet representation. On three challenging systems of chaotic fluid dynamics, WFM achieves superior long-horizon stability, accuracy and spectral coherence compared to state-of-the-art models. Our results clearly position the wavelet space as an effective training-free representation for generative emulation of complex physical dynamics.
[AI-234] Scalable Uncertainty Reasoning in Knowledge Graphs ESWC2026
链接: https://arxiv.org/abs/2605.16568
作者: Jingcheng Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages. Preprint of a paper accepted at the ESWC 2026 PhD Symposium
Abstract:Knowledge Graphs are pivotal for semantic data integration. The real-world data they model is often inherently uncertain. Within knowledge graphs, uncertainty manifests in three distinct levels: imprecise attribute values, probabilistic triple existence, and incomplete schema knowledge. However, current Semantic Web standards lack native support for reasoning over such uncertainty, and naïve extensions often incur computational intractability. In this thesis, I aim to develop a modular framework that addresses each level through tailored techniques: (1) defining probabilistic literals and a corresponding query algebra for continuous attributes; (2) a compilation-based framework transforming SPARQL provenance into tractable probabilistic circuits for uncertain triples; and (3) topology-aware geometric embeddings for statistical schema reasoning. The central hypothesis is that specialized reasoning mechanisms, namely algebraic, logical, and geometric approaches, can reconcile semantic precision with computational tractability.
[AI-235] Automatic Unsupervised Ensemble Outlier Model Selection–Extended Version ICML2026
链接: https://arxiv.org/abs/2605.16567
作者: Hong-Phuc Phan,Tuan-Anh Vu,Tung Kieu,Son Ha Xuan,Bin Yang,Christian S. Jensen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 25 pages. An extended version of “Automatic Unsupervised Ensemble Outlier Model Selection” accepted at ICML 2026
Abstract:Unsupervised outlier detection is attractive because it eliminates the need for labeled data. Moreover, forming multi-model ensembles can improve detection robustness. However, composing an ensemble without labeled data is challenging. Naively composed ensembles can suffer from ensemble saturation, where redundant or unreliable detection models degrade performance and incur unnecessary computation. We propose MetaEns, an automatic unsupervised framework for selecting ensembles of outlier detection models. Using labeled meta-datasets, MetaEns learns a model that predicts marginal ensemble gains, estimating the expected improvement from adding a candidate model to a partially constructed ensemble. At test time, this learned signal is combined with a submodular-inspired proxy objective that enforces diminishing returns through diversity-aware discounting and family-level risk regularization, thereby enabling greedy sequential selection with adaptive early stopping. As a result, MetaEns constructs compact, high-quality ensembles without access to ground-truth labels. Experiments on 39 real-world datasets show that MetaEns consistently outperforms state-of-the-art unsupervised selectors and ensemble baselines, achieving higher average precision while using fewer models.
[AI-236] Skim: Speculative Execution for Fast and Efficient Web Agents
链接: https://arxiv.org/abs/2605.16565
作者: Mike Wong,Kevin Hsieh,Suman Nath,Ravi Netravali
机构: 未知
类目: Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 14 pages, 21 figures
Abstract:Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today’s web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim’s key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path’s final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.
[AI-237] From Prompts to Protocols: An AI Agent for Laboratory Automation
链接: https://arxiv.org/abs/2605.16552
作者: Angelos Angelopoulos,James F. Cahoon,Ron Alterovitz
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed-loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node-based diagrams synchronized with the AI agent’s protocol representation, enabling seamless alternation between AI-assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first-attempt protocol generation success rate and an order of magnitude reduction in required interface actions.
[AI-238] Inventorship in AI-Assisted Inventions: Designing an Experiment to Shape Case Law
链接: https://arxiv.org/abs/2605.16528
作者: Yevhenii Shchetynin,Duygu Usta,Bryan Khan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The latest improvements in artificial intelligence (AI) raise new challenges for intellectual property laws, particularly concerning the inventorship issue in AI-assisted inventions - that is, those in which AI is used in the inventive process. While most jurisdictions allow only a natural person to be considered the inventor, the question of how to deal with AI-assisted inventions remains relevant. Namely, what is the nature and contribution of AI tools in an AI-assisted invention that would prevent a human from being recognized as its inventor? The main challenge in addressing this question is the lack of case law on the issue. It is reasonable to assume that with the development of AI and the growing interest in its use in the inventive process, new cases will naturally arise, which in turn will harmonize and address the inventorship issue in AI-assisted inventions to some extent. However, this process will take significant time and may not keep pace with the rapid development of AI, nor fully address the new problems that arise alongside AI advancements. This research proposes the conditions of an experiment to create relevant case law. This experiment could be initiated by society, involving stakeholders specializing in AI. The article also proposes a methodology for conducting the experiment and selecting cases that best reflect the current state of AI use in the inventive process. Conducting such an approach will help identify the most effective methods for measuring human contribution to AI-assisted inventions when determining inventorship.
[AI-239] Hypergraph Pattern Machine: Compositional Tokenization for Higher-Order Interactions
链接: https://arxiv.org/abs/2605.16527
作者: Kyrie Zhao,Zehong Wang,Tianyi Ma,Fang Wu,Xiangru Tang,Pietro Lio,Sheng Wang,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hypergraphs model higher-order relations that drive real-world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher-order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion-aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state-of-the-art methods. Notably, in a real adverse-event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature-identical candidates, a discrimination existing methods cannot make. The code and data are in this https URL.
[AI-240] No Plan Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task
链接: https://arxiv.org/abs/2605.16514
作者: Michael Migacev,Vito Mengers,Antonia Köngeter,Oliver Brock
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding why some sequential planning problems are harder than others requires models that go beyond average performance. They should capture the specific pattern of which problems are hard, and ideally fail in the same way people do when planning capacity is reduced. We apply AICON, a reactive gradient-descent framework developed for robotic manipulation, to the Tower of London test, a cognitive test used to assess planning in Parkinson’s disease, mild cognitive impairment, and stroke. Without any lookahead planning or knowledge of human cognition, AICON reproduces the fine-grained human difficulty ordering across 24 problems better than structural task parameters and generalizes to held-out problems in a leave-two-out evaluation. Crucially, AICON outperforms a planning baseline for groups with reduced planning capacity while the planning baseline better captures healthy controls. This dissociation was predicted by the original AICON paper, which noted that the model’s failure modes resemble those of Parkinson’s patients who struggle with goal hierarchies but not move counts. This suggests that as planning capacity is reduced, human behavior shifts toward the reactive mode AICON models. The finding extends a broader pattern: AICON, originally built for robotics, now captures aspects of biological behavior across perception, eye movements, and sequential planning, suggesting its core abstraction reflects something real about how biological systems are organized.
[AI-241] Strategic Over-Parameterization for Generalizable Low-Rank Adaptation
链接: https://arxiv.org/abs/2605.16470
作者: Jing Gao,Zhong-Yi Lu,Pan Zhang,Ze-Feng Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting large language models (LLMs) to downstream tasks via full fine-tuning is increasingly impractical due to its computational and memory demands. Parameter-efficient fine-tuning (PEFT) approaches such as Low-Rank Adaptation (LoRA) mitigate this by confining updates to a compact set of trainable parameters, but this aggressive reduction often sacrifices generalization, especially under transfer across heterogeneous tasks and domains. We revisit the tension between parameter efficiency and adaptation capacity, and ask whether the two are truly at odds. We answer in the negative by introducing LoRA-Over, a framework grounded in a simple principle: enrich the optimization landscape during training, then collapse the enrichment at inference. LoRA-Over injects auxiliary parameters into the low-rank adapters during training to broaden the effective hypothesis space, and through a decomposition-based reformulation folds them back into a standard low-rank structure with negligible reconstruction error, keeping inference cost identical to vanilla LoRA. Since not all weight matrices benefit equally from added capacity, we further propose two scheduling strategies, one statically predefined and one dynamically determined at runtime, that direct extra capacity where most needed. We evaluate LoRA-Over on language understanding (GLUE, T5-Base), dialogue (MT-Bench), arithmetic reasoning (GSM8K), and code generation (HumanEval), using LLaMA 2-7B and LLaMA 3.1-8B. Across all benchmarks and scales, LoRA-Over consistently outperforms vanilla LoRA, showing that principled over-parameterization designed to vanish at inference is an effective lever for improving PEFT generalization. Code will be released upon acceptance.
[AI-242] Asking Back: Interaction-Layer Antidistillation Watermarks
链接: https://arxiv.org/abs/2605.16462
作者: Guang Yang,Amir Ghasemian,Fengchen Liu,Zhong Wang,Ninareh Mehrabi,Homa Hosseinmardi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages, 17 figures
Abstract:Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker’s training pipeline nor the next-token logits. Existing defenses operate on the teacher’s output tokens – biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher’s interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker – an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via black-box queries with a human-validated LLM-as-judge (Cohen’s kappa = 0.84/0.78 on strong/style rubrics). Across 63 LoRA-distilled students under a Llama-3.3-70B-Instruct teacher (35,343 judged samples), behavioral watermarks transfer at 88.9% (Gemma) / 80.9% (OLMo) / 45.2% (Qwen) relative fidelity (H1, H2). Under non-adaptive DIPPER paraphrasing, robustness decomposes into a teacher-self ceiling (about 66.4%) and student-relative retention of 21-112%, with OLMo preserving the watermark above the teacher itself (H3, F-Amp). Low-density (about 20%) explicit and implicit declarative variants transfer above per-family baseline (H4, F-Style). An N=20 in-lab study (pre-registered Latin-square) shows all marker variants within 0.22 Likert step of baseline; TOST, Friedman, and Bonferroni-Wilcoxon support H5. The interaction layer is a viable design locus for antidistillation watermarking, complementary to token-, model-, and reasoning-trace-layer defenses. Comments: 34 pages, 17 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16462 [cs.CR] (or arXiv:2605.16462v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.16462 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-243] Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign
链接: https://arxiv.org/abs/2605.16452
作者: Jiahui Li,Yida Zhang,Zixuan Zeng,Jiayu Chen,Yingjian Song,Yin Xiao,Nishan Dong,Junjie Lu,Younghoon Kwon,Xiang Zhang,Jin Lu,Wenzhan Song,Fei Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a “peak-representation” technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model’s self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.
[AI-244] Physics-Guided Geometric Diffusion for Macro Placement Generation IJCAI2026
链接: https://arxiv.org/abs/2605.16451
作者: Jongho Yoon,Jinsung Jeon,Seokhyeong Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2026. 9 pages, 5 figures
Abstract:Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics-guided geometric diffusion framework. Specifically, we design a dual-domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics-Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state-of-the-art baselines with a 6.1-6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large-scale designs where prior methods fail to converge. The source code is available at this https URL.
[AI-245] PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
链接: https://arxiv.org/abs/2605.16449
作者: Hua Wang(1),Xianhao Jiao(1),Fan Zhang(2) ((1) School of Computer and Artificial Intelligence, Ludong University, Yantai, Shandong 264025, China, (2) School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong 264005, China)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures, 13 tables
Abstract:Deep forecasting models often suffer from attenuated periodic perception and entangled trend-noise representations as network depth increases. Moreover, the widely adopted channel-independent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross-variable consistency in multivariate time series. To address these issues, we propose PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that jointly emphasizes interpretability and predictive accuracy. PESD-TSF introduces three key designs. First, a Multiplicative Periodic Gating mechanism incorporates continuous-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers. Second, a multi-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long-term trends from high-frequency variations while retaining fine-grained temporal semantics. Third, to recover disrupted inter-variable dependencies, we propose Cross-Scale Collaborative Attention (CSCA) together with an RLC regularization scheme, which reconstructs global inter-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD-TSF consistently achieves state-of-the-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter-variable coupling, highlighting its superior structural modeling capability and generalization.
[AI-246] Nested Spatio-Temporal Time Series Forecasting
链接: https://arxiv.org/abs/2605.16447
作者: Yinghao Ai,Yukai Zhou,Ruoxi Jiang,Junyi An,Chao Qu,Zhijian Zhou,Shiyu Wang,Fenglei Cao,Zenglin Xu,Furao Shen,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatiotemporal forecasting is critical for real-world applications like traffic management, yet capturing reliable interactions remains challenging under noisy and non-stationary conditions. Existing methods primarily rely on historical spatial priors, often failing to account for evolving temporal correlations and suffering from systematic errors. In this work, we propose a nested forecasting framework that couples future macro-level regional trends with micro-level historical observations, enabling top-down guidance from abstract future representations for fine-grained forecasting. Specifically, we employ a spectral clustering-based approach to construct semantically coherent regions, providing both theoretical and empirical evidence that this representation effectively filters systematic noise while preserving essential trends. Building on this, we develop a progressive coarse-to-fine predictor to integrate these representative features into the inference process. This enables the model to leverage trend predictions to anticipate dynamic anomalies, such as periodic offsets, in advance. Furthermore, extensive experiments on multiple high-dimensional datasets demonstrate that our method consistently outperforms state-of-the-art baselines, validating the effectiveness of future macro-guided nested forecasting.
[AI-247] Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating
链接: https://arxiv.org/abs/2605.16446
作者: Hangchun Liang,Changchun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes – Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static- \lambda frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.
[AI-248] Membership Inference Attacks on Discrete Diffusion Language Models
链接: https://arxiv.org/abs/2605.16445
作者: Shailesh Kasivelrajan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Diffusion Language Models MDLMs replace autoregressive generation with iterative demasking and their privacy properties are largely unstudied. We study membership inference attacks MIA on fine tuned MDLMs and show they are significantly more vulnerable than current grey box baselines suggest. We extract a 46 dimensional feature vector from the models reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top. On the MIMIR benchmark across six text domains XGBoost achieves mean AUC 0.878 peaking at 0.930 on Pile CC and beats the SAMA grey box baseline by 0.062 AUC on average. A leave one signal out ablation shows that the ELBO trajectory alone drives most of this with a mean drop of 0.130 when removed while attention features add almost nothing below 0.003. We also design a shadow model transfer attack where K equals 3 surrogate MDLMs trained on data from unrelated domains generate classifier labels with no access to the target domain. This achieves 0.858 mean AUC within 0.020 of the white box oracle and establishes shadow model transfer as a practical and near equally effective attack path.
[AI-249] wo-Valued Symmetric Circulant Matrices: Applications in Deep Learning
链接: https://arxiv.org/abs/2605.16443
作者: Jayakrishna Amathi,Venkata Prasanth Yanambaka,Saraju P. Mohanty,Elias Kougianos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80 \times reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.
[AI-250] Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction
链接: https://arxiv.org/abs/2605.16442
作者: Ganeshaaraj Gnanavel,Tharindu Fernando,Sridha Sridharan,Clinton Fookes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.
[AI-251] DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
链接: https://arxiv.org/abs/2605.16441
作者: Jiahui Li,Ruili Fang,Zishuai Liu,WenZhan Song,Jin Lu,Fei Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Beat-level Electrocardiography (ECG) arrhythmia detection aims to assign an arrhythmia class to each beat in a recording, yet many existing systems treat beats as isolated local instances. This is limiting because beat labels often depend on multi-beat rhythm context, including timing, compensatory pauses, and beat-to-beat morphological consistency. We present DeepArrhythmia, a tool-grounded multimodal framework for segment-contextualized beat-level ECG arrhythmia classification. Given a multi-beat ECG segment, DeepArrhythmia combines the raw ECG signal and a rendered waveform image, localizes R peaks to identify beat instances, and produces structured beat-level predictions. The framework decouples physiological measurement from evidence integration using specialized tools for beat localization, numerical rhythm–morphology extraction, and morphology-focused textual analysis. DeepArrhythmia uses segment-level confidence to route between minimal and rich evidence states, since richer physiological evidence is not uniformly useful. This agentic design integrates rhythm context, explicit physiological grounding, and selective evidence acquisition for decision making.
[AI-252] Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers
链接: https://arxiv.org/abs/2605.16438
作者: Andras Ferenczi(1),Sutapa Samanta(1),Dagen Wang(1),Jason Qizhe Qin(2) ((1) American Express Co., (2) Columbia University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 8 tables
Abstract:Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients against their nearest neighbors and can miss malicious updates that preserve the statistical properties of honest ones. We propose a quantum annealing approach that reformulates client selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem, encoding pairwise distances into a cost function solved by quantum annealers (QA). Unlike MultiKrum’s greedy per-client scoring, the QUBO formulation jointly optimizes over all subsets to find the mutually closest group of m clients. At small scale (15 clients), QUBO outperforms MultiKrum on the most challenging Byzantine attacks: e.g., Advanced LIE is detected with 95.11% accuracy versus 81.33% on MNIST and 97.78% versus 75.56% on CIFAR-10. QUBO fares poorly on simpler attacks where MultiKrum excels, so the two methods are complementary. QUBO quality also degrades as the number of clients grows. To address this, we introduce a MultiSignal ensemble that uses a dual-feature routing gate based on Euclidean and cosine Krum score gaps to classify attacks into four regimes and routes evasion attacks to a suspicion-penalized QUBO with agreement voting. At 100 clients on MNIST, MultiSignal achieves 95.3% average detection accuracy versus 91.8% for classical MultiKrum, with the largest gains on Sparse Lie (72.0% to 95.2%, +23.2 points) and Advanced Lie (80.4% to 85.2%, +4.8 points). These results show that QUBO-based quantum annealing with MultiSignal is a principled and scalable defense against the most challenging Byzantine strategies in federated learning.
[AI-253] he End of Trust: How Agent ic AI Breaks Security Assumptions
链接: https://arxiv.org/abs/2605.16436
作者: Osama Zafar,Alexander Nemecek,Erman Ayday
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:For decades, the security of digital interaction has rested on an unacknowledged economic constraint. Attackers faced a tradeoff between the fidelity of a deception and the scale at which it could be deployed. Convincing impersonation required sustained human effort and was confined to a narrow set of high-value targets, while mass-market attacks sacrificed plausibility for reach. Detection systems, verification mechanisms, and user awareness training have all been implicitly calibrated to the artifacts of cheap deception that this tradeoff produced. Agentic AI collapses the tradeoff, allowing high-fidelity, individually tailored deception to be produced at mass-market scale. We argue that this shift exhausts a security paradigm rather than merely intensifying the threat landscape. We introduce the Infinite Impostor, an attack model in which an autonomous agent interposes itself between two parties who already trust each other, hijacking an existing relationship rather than building a new one from scratch. Detection-oriented defenses share an assumption that generative progress is eliminating, that synthetic outputs are distinguishable from authentic ones. We propose a suspect-by-default paradigm that shifts security from authenticating actors to evaluating actions, and examine the governance tensions that arise when platforms become the regulatory substrate of digital interaction.
[AI-254] GPU-Accelerated Deep Learning for Heatwave Prediction and Urban Heat Risk Assessment
链接: https://arxiv.org/abs/2605.16435
作者: Adis Alihodžić
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Heatwaves are an important problem in cities, and climate change makes this problem more difficult. In this paper, we present a GPU-based deep learning framework for next-day prediction of urban thermal conditions and for heat risk assessment. The study was carried out in Sarajevo by using MODIS land surface temperature data and Open-Meteo forecast data. We tested several models, including convolutional models and spatiotemporal models. Among them, ConvLSTM with a mixed loss function gave the best results. The obtained values were MAE = 0.2293, RMSE = 0.3089, and R2 = 0.8877. The experiments also showed that results can be improved by using longer temporal series and additional meteorological variables. Since the framework was implemented on a GPU and trained with mixed precision, the execution time was reduced. Based on the predicted temperature fields, it was also possible to combine hazard information with exposure and vulnerability data in order to generate city heat risk maps. The proposed framework can be used as a practical basis for city heat analysis.
[AI-255] Edge-AI-Driven Learning-to-Rank for Decentralized Task Allocation in Circular Smart Manufacturing
链接: https://arxiv.org/abs/2605.16433
作者: Mohammadhossein Ghahramani,Yan Qiao,Mengchu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Task allocation in smart manufacturing systems needs to operate under decentralized decision-making, dynamic workloads, and shared resource constraints. In circular manufacturing settings, these challenges are further intensified by the need to balance operational efficiency with resource and energy sustainability. While learning-based approaches have been explored, many focus on predicting absolute performance metrics that do not necessarily translate into improved allocation outcomes, since decentralized assignment is governed by the relative ordering of candidate machines. This work proposes an Edge-AI-driven decentralized task allocation framework based on ranking-aware negotiation, where lightweight decision intelligence is embedded at the machine level to enable low-latency coordination without centralized control. The framework is developed progressively: a resource-aware heuristic first establishes the decentralized bidding structure, an Edge-AI-based regression model then provides learned local bid approximation, and a ranking-aware formulation finally reshapes the learning objective to align with the ordering-based nature of winner selection. Each machine evaluates incoming tasks using local information, including processing capability, queue state, and resource contention. The framework is evaluated via discrete-event simulation under high-load and tight-deadline scenarios using delay, deadline violations, throughput, and energy consumption. Results show improved delay and deadline adherence under high load, and enhanced energy efficiency under tighter constraints, leading to more resource-efficient operation aligned with circular manufacturing objectives. These findings demonstrate that aligning learning objectives with decentralized decision structures is critical for effective negotiation-driven task allocation.
[AI-256] A Theory of Training Profit-Optimal LLM s
链接: https://arxiv.org/abs/2605.16430
作者: Sophie Hao,William Merrill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency E (FLOPs/\ ) at a near-linear rate; total training cost then scales sub-quadratically in E . Data efficiency improvements incentivize larger models and training expenditure. When we are limited to D data, profit-optimal training expenditure scales as D^2/E , i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.
[AI-257] QuantFPFlow: Quantum Amplitude Estimation for Fokker–Planck Policy Optimisation in Continuous Reinforcement Learning
链接: https://arxiv.org/abs/2605.16429
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce \textbfQuantFPFlow, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker–Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function Z = \int e^-V(\mathbfx)/D,d\mathbfx at cost \calO(1/\varepsilon^2) ; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving \calO(1/\varepsilon) – a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the \calO(1/\varepsilon) algorithmic structure. The estimated stationary distribution \rhostar drives a theoretically grounded exploration bonus \Raug = \Renv + \alpha\log(1/\rhostar(s)) . This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward 1,295.7 \pm 423.2 versus 1,284.0 \pm 474.0 for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf10.4,% more frequently (33.9,% vs.\ 30.7,%). Policy entropy remains near H(\pi)\approx 6.5 ,nats throughout training, whereas SAC collapses to 1.5 ,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of \calO(d^0.35) for QuantFPFlow versus \calO(d^0.76) for classical FP estimation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16429 [cs.LG] (or arXiv:2605.16429v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-258] Orthologic for SAT Solving
链接: https://arxiv.org/abs/2605.16421
作者: Vladislas de Haldat,Simon Guilloud,Viktor Kunčak
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a new algorithm for deciding formula entailment in orthologic (a sound approximation of classical logic) that avoids the costly preprocessing phase of prior implementations while retaining the same \mathcalO(n^2(1+|A|)) worst-case complexity. We then introduce a family of synthetic SAT benchmarks based on the observation that, for any formula \phi , the equivalence \phi \leftrightarrow \mathrmNF_\mathrmOL(\phi) is a tautology whose Tseitin encoding yields unsatisfiable instances that are hard for state-of-the-art SAT solvers yet have short orthologic proofs. Applied to EPFL arithmetic circuits, our algorithm solves these instances efficiently while Kissat times out on a significant fraction. Finally, we show that using orthologic normalization as a preprocessing step can improve SAT solving time on some hard problems.
[AI-259] Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery
链接: https://arxiv.org/abs/2605.16398
作者: Marios Papamichalis,Regina Ruane
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick–slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.
[AI-260] Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity
链接: https://arxiv.org/abs/2605.16389
作者: Gorkem Gemalmaz,Harun Tolasa,Volkan Patoglu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Under review for publication in IEEE Transactions on Robotics
Abstract:Haptic rendering of viscoelastic materials that exhibit creep and stress relaxation is crucial for many applications, such as medical training with realistic biological tissue models. Fractional-order viscoelastic models provide an effective means of describing intrinsically time-dependent dynamics with few parameters, as these models can naturally capture memory effects. In this study, we present analyses of passivity and rendering performance for fractional-order viscoelastic models under finite-memory discretization. We derive closed-form expressions to ensure the passivity of haptic rendering with a fractional-order (FO) standard linear solid (SLS) model based on Grunwald-Letnikov derivative under short-memory discretization. We also provide symbolic expressions for the effective stiffness and damping of such FO-SLS models. The resulting passivity conditions constitute a unified framework that generalizes previously reported results for integer-order Kelvin-Voigt, Maxwell, and SLS models, since these results are special cases of the newly derived condition. Furthermore, we provide experimental validations of the theoretical passivity bounds and human-subject evaluations of perceived realism of FO-SLS models. Overall, this study establishes a unified theoretical framework and experimental evaluations for FO viscoelastic rendering under short-memory discretization.
[AI-261] ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction
链接: https://arxiv.org/abs/2605.16380
作者: Jinwoong Kim,Sangjin Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Clinical time-series data are difficult to model with methods designed for regular sequences because they exhibit irregular sampling, frequent missing values, and heterogeneous observation patterns across variables. Existing approaches commonly use observation masks and time-gap information, but they do not continuously capture the decaying reliability of past observations or consistently organize multi-resolution information within a coherent temporal context during aggregation. To address these limitations, we propose Reliability-aware Temporal Aggregation with Mamba (ReTAMamba), which reconstructs clinical time series as time-variable token sequences, estimates observation reliability from missingness and elapsed time, and augments interval summaries with statistical descriptors. Chronological Weaving is used to integrate short- and long-term temporal information within a coherent temporal context, and a budgeted token router is applied to constrain sequence length while preserving informative summaries. Experiments on MIMIC-IV, eICU, and PhysioNet 2012 show that ReTAMamba consistently improves AUPRC over strong baselines, with average relative gains of 7.51%, 7.80%, and 10.15%, respectively. Cohort-level and patient-level analyses on eICU further showed that the learned mean decay for more dynamic signals, such as heart rate and blood pressure, was 24.3% larger than that for relatively static signals, such as laboratory test variables. These findings suggest that effective prediction in irregular clinical time series requires modeling not only what was measured, but also when and how it was observed, including information freshness and observation timeliness.
[AI-262] An Information-Theoretic Criterion for Efficient Data Synthesis ICML2026
链接: https://arxiv.org/abs/2605.16379
作者: Hanyu Li,Zhengqi Sun,Xiaotie Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 12 pages. Camera-ready version for ICML 2026
Abstract:Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model’s current distribution. When the loop is information-closed (relying on the model’s own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.
[AI-263] Mixing Times of Glauber Dynamics on Masked Language Models
链接: https://arxiv.org/abs/2605.16378
作者: Suvadip Sana,Sami Wolf,Neer Mehta,Alina Shah,Aitzaz Shaikh,Janna Goodman,Lionel Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures
Abstract:Masked language models (MLMs) define local conditional distributions over tokens but do not, in general, correspond to any consistent joint distribution over sequences. This raises a fundamental question: what global distributional behavior is induced when such conditionals are used iteratively for generation? We address this question by modeling iterative masked-token resampling as a Glauber dynamics Markov chain on the discrete space of token sequences. We first show that MLM conditionals are intrinsically incompatible: we introduce a rectangle test that certifies this incompatibility and empirically verify its prevalence across modern MLMs. We then provide a theoretical analysis of the induced Markov chain. Under bounded cross-token influence, we establish a high-temperature contraction result implying O(n\log n) mixing time where n is the sequence length. In contrast, we prove that under a uniform local margin condition, the chain exhibits metastability, with exponentially slow escape from semantic basins at low temperatures. Empirically, we demonstrate a phase transition in mixing behavior as a function of temperature and sequence length, consistent with the theoretical predictions. We further characterize the induced stationary behavior through semantic trajectories, identifying persistent structures such as long-lived traps and recurrent semantic basins, with political content serving as a measurable case study.
[AI-264] CheckSupport: A Local LLM -Powered Tool for Automated Manuscript Submission Checklist Selection and Completion
链接: https://arxiv.org/abs/2605.16377
作者: Satvik Tripathi,Don Enwerem,Kevin Song,Kristian Quevada,Jacinta Arnold,Tessa S. Cook
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transparent and standardized reporting is essential for reproducible scientific research, yet adherence to reporting guidelines remains inconsistent because of the manual effort required to select and complete checklists. We present CheckSupport, an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models, preserving data privacy and enabling reproducible, auditable workflows. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion while operating on CPU-only hardware. On average, the wall-clock time per manuscript was 12.5 seconds, including the checklist recommendation and full checklist completion. These results demonstrate that large language models, when applied as structured inference components, can reduce reporting burden and support more transparent and reproducible scientific reporting across disciplines.
[AI-265] Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
链接: https://arxiv.org/abs/2605.16374
作者: Katarzyna Filus,Kamil Faber,Roberto Corizzo,Christopher Kanan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning studies how models can adapt to new tasks while retaining previously acquired knowledge. Although a broad spectrum of methods has been proposed to mitigate catastrophic forgetting, the field remains predominantly performance-driven, with limited insight into what forgetting actually corresponds to within the vision model’s representation space. Prior work has primarily analyzed forgetting through task-level performance or coarse measures of representational drift, without disentangling output-level accessibility from changes in finer-grained internal structure. To this end, we propose a diagnostic framework that leverages Sparse Autoencoders (SAEs) to define a task-anchored latent feature space, enabling analysis of how task-specific information evolves at a finer granularity, where individual SAE latents are treated as concept proxies for recurring and relatively disentangled visual patterns in the model’s internal computations. Within this framework, we decompose forgetting into apparent concept deletion, recoverability, and decodability. We show that a large portion of seemingly lost concept-level information can often be recovered under linearity assumption, with concept decodability degrading as more tasks are introduced. Overall, our findings suggest that a significant part of concept-level forgetting can be attributed to changes in the representational accessibility rather than complete information erasure.
[AI-266] When Is Rank-1 Steering Cheap? Geometry Granularity and Budgeted Search
链接: https://arxiv.org/abs/2605.16362
作者: John T. Robertson,Jianing Zhu,Haris Vikalo,Zhangyang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce \emphconcept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson r=0.44 with trials-to-95%, r=-0.46 with best-found utility, both p0.001 ). We present \textitGRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from \textitwhen does rank-1 fail?'' to \textitwhen is rank-1 cheap and stable?‘’, turning activation geometry from a descriptive tool into an actionable prior for LLM control. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16362 [cs.LG] (or arXiv:2605.16362v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16362 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-267] ailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification
链接: https://arxiv.org/abs/2605.16361
作者: Xinyu Chen,HanQin Cai,Lijun Ding,Jinhua Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We present TailedTS, a large-scale benchmark dataset derived from Wikipedia hourly page view observations throughout 2024, specifically designed to test time series forecasting models under heavy-tailed, zero-inflated, and non-Gaussian conditions. The dataset comprises approximately 24.69 billion data points spanning roughly 3 million unique Wikipedia pages per month, stored in high-efficiency Apache Parquet format. Wikipedia traffic follows a pronounced power-law distribution where roughly 5% of pages account for over 70% of total page views, creating a natural and rigorous testbed for model robustness against extreme volatility that are absent from or underrepresented in existing benchmarks such as M4, M5, and UCI electricity datasets. TailedTS enables several research tasks. First, we introduce a periodicity quantification framework based on sparse autoregression with sparsity and non-negativity constraints, revealing that frequently-viewed pages exhibit significantly weaker periodic structure than their less-viewed counterparts, showing direct implications for server allocation and traffic forecasting on large digital platforms. Second, we provide standardized prediction benchmarks evaluated under a suite of non-Gaussian loss functions, including \ell_1 -norm, Huber, quantile, and \ell_p -norm losses, demonstrating that standard Gaussian-based estimators degrade substantially on high-volume page categories, while robust alternatives provide consistent gains across all traffic scales. TailedTS is publicly available at this https URL.
[AI-268] ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
链接: https://arxiv.org/abs/2605.16360
作者: Junjie Li,Jiong Lou,Jie Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost–accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering \sim 98.7% of its mean accuracy) while delivering up to a 3.21\times prefilling speedup on Llama-3.1-8B (dual-GPU; \sim 1.5\times shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16360 [cs.LG] (or arXiv:2605.16360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16360 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junjie Li [view email] [v1] Sat, 9 May 2026 13:18:01 UTC (762 KB) Full-text links: Access Paper: View a PDF of the paper titled ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference, by Junjie Li and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-269] LEAF: A Living Benchmark for Event-Augmented Forecasting
链接: https://arxiv.org/abs/2605.16358
作者: Mingtian Tan,Mihir Parmar,Palash Goyal,Chun-Liang Li,Nanyun Peng,Thomas Hartvigsen,Jinsung Yoon,Tomas Pfister
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 tables, 6 figures, 39 pages
Abstract:Large Language Models (LLMs) are increasingly applied to forecasting. To evaluate this capability while mitigating pre-training data contamination, several living benchmarks have been proposed. However, existing benchmarks either lack the multidimensional events essential for accurate forecasting due to data scarcity, or focus on relatively closed environments. To assess the predictive capabilities of LLMs in complex, real-world scenarios, we propose LEAF, the first living benchmark for event-augmented forecasting tasks, including future event probabilities, trend and time series forecasting. LEAF utilizes a recursive retrieval agent system paired with dual-agent cross-validation to provide comprehensive and relevant auxiliary text for forecasting. Evaluating state-of-the-art proprietary and open-weight LLMs, we find that these models can leverage signals extracted from complex events to enhance predictive performance. In the stock domain, we find that LLMs achieve better performance on equities they confidently identify as more predictable. Furthermore, the events demonstrate a strong correlation with the target equities. To this end, LEAF provides a necessary, dynamically updating testbed to continuously track and drive progress in event-driven forecasting tasks.
[AI-270] PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
链接: https://arxiv.org/abs/2605.16351
作者: Sangyoon Bae,Shinjae Yoo,Jiook Cha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
Abstract:Scientific foundation models are expected to reuse representations under changes in dataset, acquisition protocol, and deployment domain, yet many sequence backbones treat scientific temporal structure as an unconstrained pattern to be fitted. We argue that this misses a central property of natural dynamical systems: neural and atmospheric time series are organized by interacting processes across multiple physical timescales, and failure to preserve this multiscale structure contributes to brittleness under distribution shift. We formalize this failure mode as temporal kernel mismatch, where a model fits in-distribution dynamics with an effective memory policy that is not anchored to the signal’s physical timescales, leading to representation drift and degraded transfer. We propose Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that maps spectrum-estimated transition points between frequency regimes (knee frequencies) to scale-specific discretization parameters and anchors them to acquisition time units. On Human Connectome Project fMRI, PIMSM improves robustness and representation stability under severe temporal-context truncation, extreme low-resource transfer, and resting-state-to-task-state generalization. Without modality-specific adaptation, the same architecture also attains the lowest variable-wise MAE across all reported horizons and variables on Weather-5K held-out-station spatial out-of-distribution forecasting. These results support temporal-scale alignment as a practical inductive bias for scientific foundation models that must preserve structure, not only fit correlations, under deployment shift.
[AI-271] Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation
链接: https://arxiv.org/abs/2605.16350
作者: Hong Chen,Pengcheng Wu,Yuanguo Lin,Peilin Zhao,Xiuze Zhou,Fan Lin,Han Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We rethink Federated Learning (FL) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non-IID client data. To address this, we propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system. FedNL embeds Titans-based linear attention into FL, enabling clients to perform lightweight, zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks show that FedNL achieves competitive performance in short-context reasoning, enhances the performance of long-context retrieval and streaming Cross-Entropy, and maintains constant inference memory.
[AI-272] Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field
链接: https://arxiv.org/abs/2605.16348
作者: Kim Yong Tan,Yueming Lyu,Ivor Tsang,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training-free guidance enables pre-trained diffusion and flow models to optimize application-specific objectives using feedback from external black-box reward functions. However, existing methods are feedback-inefficient because reward feedback is used only transiently to inform a localized gradient approximation or a discrete search decision, and is subsequently discarded. To address this limitation, we propose Flow-Direct, a framework that guides the generation process via a persistent guidance field. Theoretically, this guidance field is analytically derived from the log-density ratio between the base and reward-weighted target distributions; it transports the pre-trained distribution to the target distribution. In practice, the field is implemented as a non-parametric estimator constructed from all accumulated reward-evaluated samples. As more samples are collected during optimization, this empirical guidance field becomes increasingly accurate. This persistent formulation yields two major advantages. First, Flow-Direct is highly feedback-efficient: because every evaluated sample is used to refine the global guidance field, no reward information is wasted. Second, the framework is naturally reusable: once optimization is complete, the collected dataset defines a reusable guidance field for generating novel target samples without additional reward evaluations, and distinct guidance fields can be combined to generate samples that simultaneously satisfy multiple objectives.
[AI-273] PropGuard: Safeguarding LLM -MAS via Propagation-Aware Exploration and Remediation
链接: https://arxiv.org/abs/2605.16346
作者: Bingyu Yan,Xiaoming Zhang,Jinyu Hou,Chaozhuo Li,Ziyi Zhou,Xiaozhe Zhang,Litian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:LLM-based multi-agent systems (LLM-MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system-level compromise. Existing defenses largely rely on local filtering or graph-based anomaly detection, but they often fail to trace fine-grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation-aware framework for safeguarding LLM-MAS. PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task-level defense success, achieving a favorable effectiveness–efficiency trade-off.
[AI-274] Goal-Conditioned Supervised Learning for LLM Fine-Tuning
链接: https://arxiv.org/abs/2605.16345
作者: Shijun Li,Kaiwen Dong,Xiang Gao,Joydeep Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment, can directly optimize outcome quality but typically rely on external reward models and iterative rollouts, making them costly and difficult to deploy in many cases. Offline methods are more efficient, but prevailing approaches such as supervised fine-tuning (SFT) and direct preference optimization (DPO) remain limited: SFT typically collapses graded feedback into binary supervision, while DPO depends on paired preference data that is often unavailable or expensive to construct. In this paper, we propose goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs. Our core idea is to treat feedback signals directly as an explicit goal and train the model, purely through supervised learning, to generate responses that achieve that goal. To better exploit graded feedback, we further introduce a novel goal formulation that defines learning as consistently pursuing outcomes above a target quality threshold, rather than imitating samples from a selected high-quality subset. This design mitigates the bounded-learning effect of SFT and classic GCSL by explicitly guiding the model to learn the directional progression of quality. We also propose natural-language goal representations to better leverage the semantic understanding and reasoning capabilities of LLMs. We evaluate our method on three tasks: non-toxic generation, code generation, and LLM for recommendation. Results show that our approach consistently outperforms standard offline fine-tuning baselines while retaining the efficiency, scalability, and simple data requirements of supervised learning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16345 [cs.LG] (or arXiv:2605.16345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16345 Focus to learn more arXiv-issued DOI via DataCite
[AI-275] LoopQ: Quantization for Recursive Transformers
链接: https://arxiv.org/abs/2605.16343
作者: Rui Fang,Hsi-Wen Chen,Ming-Syan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Looped language models (LoopLMs) improve parameter efficiency by recursively reusing Transformer blocks, enabling deeper computation under a fixed model size. However, this reuse makes LoopLMs more fragile under post-training quantization (PTQ). We present the first systematic study of quantization in LoopLMs and identify three challenges: distribution shift across roles, state reuse across loop transitions, and recursive error accumulation. To address these challenges, we propose LoopQ, a loop-aware PTQ framework that preserves a shared quantized backbone while introducing lightweight adaptations. LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization to reduce distributional mismatch within loops and error accumulation across loops. Experiments across seven benchmarks show that, under W4A4 quantization, LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% compared with the strongest static PTQ baseline.
[AI-276] Detecting Verbatim LLM Copy-Paste in Homework
链接: https://arxiv.org/abs/2605.16336
作者: Aizierjiang Aiersilan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) have made fluent essay writing, code drafting, and quiz answering instantly available to students at every level, from secondary school through graduate study. Many educators do not object to LLM use \emphper~se; what they need to detect is the case in which a student pastes the assignment prompt into a chatbot and submits the model’s reply verbatim, without engaging with the work. Existing post-hoc AI-text detectors remain unreliable and have been shown to penalise non-native English writers, while output-side watermarks require cooperation from the model provider. We propose an alternative that the educator controls directly: an input-side watermark in which an invisible instruction is embedded inside the visible assignment prompt itself. An LLM that ingests the prompt verbatim quietly reads the hidden instruction and writes a tell-tale signature into its reply, exposing the copy-and-paste pathway specifically. We describe SteganoPrompt, a single-page, zero-dependency web tool that encodes an arbitrary printable-ASCII payload into the deprecated Unicode Tags block (\textttU+E0000–\textttU+E007F). The encoded string is visually identical to the original, survives common copy-paste channels (Word, Google Docs, PDF, Markdown, Slack, e-mail, the major learning-management systems), and is reliably tokenized by frontier models. We evaluate compliance across seven LLM families and a representative set of educational content channels. The work is informed by my experience as a graduate teaching assistant for an undergraduate software engineering course at the George Washington University. The tool is released under the MIT licence at \urlthis https URL.
[AI-277] Differentiable Optimization Layered Safety-Critical Control for Risk-Aware Navigation via Conformal Prediction
链接: https://arxiv.org/abs/2605.16327
作者: Jinyang Dong,Shizhen Wu,Yongchun Fang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Risk-aware navigation in unknown environments is a fundamental challenge for autonomous vehicles operating in complex urban systems. To address this issue, this paper presents a differentiable optimization layered safety-critical control method based on conformal prediction. First, to handle uncertainties arising from sensor noise, the conformal prediction method is employed to generate risk-aware obstacle ellipsoids around an elliptical-shaped robot. Second, two nested differentiable optimization layers are introduced to build the control barrier functions for obstacle avoidance and feasibility guarantee, respectively. Then, a quadratic program based safety-critical control law is proposed to integrate the above control barrier function constraints as well as input constraints. In the end, the effectiveness of the proposed framework is demonstrated through numerical simulations.
[AI-278] Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry
链接: https://arxiv.org/abs/2605.16325
作者: Truong Xuan Khanh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 2 figures
Abstract:Phase-transition phenomena in deep learning (grokking, emergent capabilities, and ontological reorganization under context shift) have been studied through several lenses, including representational compression, singular learning theory, and information-theoretic progress measures. Independently, non-equilibrium statistical physics has identified phase transitions in driven chemical reaction networks underlying prebiotic selection, with empirical signatures that are difficult to reproduce within single-field gradient accounts. We propose a perspective in which both classes of phenomena admit a common description as driven informational systems: stochastic processes governed by two gradient fields, an entropy production rate Sigma and an information quasi-potential Phi_I := -ln p*, where p* is the stationary density. Within this framework we introduce two candidate order parameters: an adversarial breakdown threshold alpha_dagger and a self-referential coupling threshold kappa_c. The joint scaling of (alpha_dagger, kappa_c) defines a candidate universality class with exponents (gamma_1, gamma_2). We outline the geometric structure of this framework, identify falsifiable predictions distinguishing it from single-field alternatives, and show consistency with recent empirical findings (2024–2026) on alignment transitions, adversarial breakdown scaling, and partial introspection in large language models. Comments: 29 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16325 [cs.LG] (or arXiv:2605.16325v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16325 Focus to learn more arXiv-issued DOI via DataCite
[AI-279] AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery KDD2027
链接: https://arxiv.org/abs/2605.16320
作者: Ahmed Elmahdi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 1 table. Full paper in preparation for KDD 2027
Abstract:We present AdaGraph, a graph-native clustering algorithm born from the Structure-Centric Machine Learning (SC-ML) paradigm – a new field of unsupervised learning that replaces geometry-centric (distance-based) computation with structure-centric (topology-based) computation, fundamentally dissolving the curse of dimensionality. AdaGraph operates entirely within the kNN graph topology, a representation that retains meaningful relational structure in arbitrarily high dimensions where Euclidean distance metrics become uninformative. AdaGraph requires no a priori specification of the number of clusters k, handles noise natively, and scales via the SLCD (Sample-Learn-Calibrate-Deploy) prototype-deployment framework. As its unsupervised tuning objective, AdaGraph pairs with Graph-SCOPE, the topology-based cluster validity index introduced as a separate SC-ML contribution. On 10 synthetic benchmarks spanning d=10 to d=5000, Graph-SCOPE achieves mean ARI=0.900 and correctly selects k on 9/10 datasets – outperforming Silhouette, Davies-Bouldin, and Calinski-Harabasz – while maintaining Kendall tau = 0.92 with ground-truth cluster quality across all dimensionalities (Silhouette: tau ~= 0.46). We validate AdaGraph across three scientific domains: (1) gene co-expression discovery in hepatocellular carcinoma (GSE14520, 10,000 genes, 488 patients, no dimensionality reduction), where AdaGraph identifies condition-specific gene modules that WGCNA, ICA, NMF, and Spectral Biclustering fail to resolve; (2) natural language text clustering, where AdaGraph achieves ARI=0.751 on 20NG-6cat versus HDBSCAN’s 0.464 (62% relative improvement); (3) materials science clustering of superconductors (145-dimensional Magpie features), perovskites, and JARVIS-DFT materials, where AdaGraph achieves the highest Graph-SCOPE on all three datasets.
[AI-280] A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning
链接: https://arxiv.org/abs/2605.16315
作者: Arahan Kujur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:We show that a threshold in decision capacity determines whether self-play reinforcement learning agents collapse under asymmetric rule perturbations. Across poker variants, matrix games, a dice game, and multiple learning algorithms, eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, a fixed point at near-maximal loss. Preserving even a single positive-reach contingent decision point prevents this collapse. A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself. The phenomenon is timing-invariant, fully reversible upon action restoration, and intensifies under function approximation. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with severity scaling continuously via reach-weighted capacity in the tested domains.
[AI-281] When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning
链接: https://arxiv.org/abs/2605.16312
作者: Arahan Kujur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures, 18 tables
Abstract:We study adversarial action masking in self-play reinforcement learning: an attacker selectively removes legal actions from a victim’s action set. Unlike observation or action perturbations, removal eliminates decision options before the agent acts. Across poker games scaling from 6 to 5,531 information states and two non-poker domains, learned masking causes substantially more damage than random masking and learned perturbation baselines. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC _w ) and a value-weighted refinement CAC _v . These results identify action availability as a distinct robustness surface in self-play RL.
[AI-282] UVTran: Accurate Hole-Filling Parameterization with Transformers
链接: https://arxiv.org/abs/2605.16306
作者: JunFeng Zhang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:In industrial design, N-sided hole filling is typically formulated as the construction of a single trimmed B-spline surface by minimizing a fairness energy subject to geometric boundary constraints. This formulation requires an accurate parameter-space representation of the trimming curve on the filling surface. Most existing methods project the hole boundary onto a nearby plane or polygon to establish correspondence; however, they often neglect boundary heterogeneity, which can yield biased mappings, degrade fairness, and even cause filling failures. We propose UVTran, a transformer-based framework that predicts an auxiliary projection surface better to capture the geometric characteristics of the hole boundary. Exploiting B-spline locality, we design a cross-attention mechanism that biases each surface control point toward the nearby hole boundary, preserving local geometric detail. We voxelize control-point coordinates and formulate the fitting problem as a classification task, which reduces the model’s sensitivity to small numerical perturbations and noise. We adopt a progressive-resolution training strategy that injects controlled discretization errors at coarse resolutions to mimic distribution shifts, thereby mitigating overfitting and improving generalization at high resolution. On our benchmark, UVTran outperforms both industrial and academic baselines: the tolerance-satisfaction rate improves by 12% , and it consistently produces fair filled surfaces even under complex hole boundary conditions. These results suggest that UVTran yields more faithful correspondences and fairer trimmed surfaces across a wide range of N-sided holes.
[AI-283] MANTA: Multi-turn Assessment for Nonhuman Thinking Alignment
链接: https://arxiv.org/abs/2605.16301
作者: Allen Lu,Isabella Luong,Joyee Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model’s actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA’s scorer design. Code, dataset, and evaluation logs are available at this https URL.
[AI-284] ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
链接: https://arxiv.org/abs/2605.16299
作者: Yixu Huang,Xinglei Yu,Zhongyu Wei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at code generation but remain heavily reliant on large-scale annotated solutions and verification-based supervision, which constrains scalability and hinders sustained self-improvement. Recent solver–verifier frameworks exploit program execution as an automatic supervision signal, but their effectiveness degrades as solvers become moderately strong: verifier-generated tests increasingly confirm semantic correctness rather than exposing the remaining failure modes. We propose \textbfACE, a self-evolving code generation framework based on a solver–adversary architecture that prioritizes active failure discovery through execution-centric supervision. A single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to induce execution-level failures, such as runtime errors, exceptions, or non-termination. Supervision is derived solely from execution outcomes: robust programs are selected for supervised fine-tuning, while adversarial tests are optimized via Kahneman–Tversky Optimization using execution-derived preferences. Notably, the entire training loop requires no ground-truth code or external reward models. Experiments on CodeContests, MBPP, and LiveCodeBench demonstrate that ACE consistently outperforms strong solver–verifier baselines, achieving 3–7% absolute gains in pass@1, with larger improvements on out-of-distribution benchmarks, while maintaining competitive or improved inference efficiency.
[AI-285] Are Researchers Being Replaced by Artificial Intelligence?
链接: https://arxiv.org/abs/2605.16294
作者: Angelo A. Salatino,Ansgar Scherp,Christin Katharina Kreutz,Sahar Vahdati
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:A Nature survey from 2023 involving 1,600 researchers shows that scientists are ``concerned, as well as excited, by the increasing use of artificial-intelligence tools in research.‘’ This tension frames our central question: Are researchers being replaced by artificial intelligence? We argue that replacement is already underway-not as disappearance, but as a shift from researcher-as-creator to researcher-as-curator. As AI agents increasingly generate hypotheses, papers, and reviews, humans risk retaining responsibility while losing intellectual ownership. This article examines how AI is reshaping the scientific lifecycle and exposes the deeper danger: not that AI will fail to do science, but that humans may stop truly understanding it.
[AI-286] From Prediction to Intervention: The Evolution of AI in Biomedicine
链接: https://arxiv.org/abs/2605.16293
作者: Andrew Feinberg,Aleksandr Sarachakov,Viktor Svekolkin,Alexander Bagaev,Ferran Prat,Michael Feinberg
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 1 table
Abstract:Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate prediction of clinical outcomes and patient stratification. These systems, however, remain fundamentally observational: they learn statistical associations from historical data and operate within previously observed biological and clinical states, limiting their ability to generalize to novel therapies or unobserved interventions. We argue that AI in biomedicine is undergoing a structural transition. As biomedical decision-making increasingly depends on reasoning about intervention rather than extrapolation from past observations, predictive architectures become structurally insufficient. Systems that learn from historical data cannot, by construction, represent how biological systems evolve under perturbation, and therefore cannot reliably support decision-making in the presence of novel interventions. We introduce a conceptual framework distinguishing observational and interventional intelligence and define disease-level models as systems that explicitly represent the state, dynamics, and intervention response of biological processes. These models enable a shift from inference to simulation – reasoning about what will happen under intervention rather than what is likely based on the past. This transition also implies a shift in where value is created: from data processing and prediction toward systems that support and define decision-making under intervention. It follows directly from the structure of biomedical decision-making and defines the next stage of AI in medicine. Systems that cannot model intervention will be structurally excluded from decision-making. Comments: 10 pages, 3 figures, 1 table Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.16293 [cs.CY] (or arXiv:2605.16293v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.16293 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Aleksandr Sarachakov [view email] [v1] Tue, 14 Apr 2026 17:49:51 UTC (1,455 KB)
[AI-287] Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence? ICLR
链接: https://arxiv.org/abs/2605.16292
作者: Islem Rekik
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: ICLR HCAIR Workhop 2026 this https URL
Abstract:Perceptions of intelligence shape how learners evaluate and rely on artificial intelligence (AI) systems. Despite rapid advances in AI capabilities, the impact of sustained exposure to these tools on students’ valuation of human intelligence (HI) relative to AI remains underexplored. This paper presents a longitudinal analysis of classroom poll responses collected between 2020 and 2026 in AI-focused undergraduate and MSc courses in computer science. Data from 471 students across technical courses (such as Machine Learning and Deep Graph Learning) and design-oriented courses (such as Design Thinking for AI) reveal four recurring phases: hype, distrust, trust, and dependency. While early responses in 2020 slightly favored AI, a consistent shift toward HI emerged from 2024 onward across all MSc cohorts. By 2026, preference for HI reached 65 percent in a technical course (a 12 percentage-point increase from 2025) and 90 percent in a design-oriented course (a 36 percentage-point increase). These findings suggest a gradual reappraisal of human intelligence as AI becomes a routine tool, with implications for learner autonomy and epistemic agency. We conclude by reflecting on this cognitive shift from favoring artificial intelligence toward prioritizing human intelligence.
[AI-288] AI of the People by the People for the People: A Social Choice Approach to Collective Control of Artificial Intelligence
链接: https://arxiv.org/abs/2605.16291
作者: Paul Anton Bachmann,Niclas Boehmer,Lukas Daniel Klausner,Martin Lackner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Accepted for publication in Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
Abstract:With the growing adoption of AI systems, reasoning about how society can exert control over AI becomes an increasingly urgent problem. Existing work on democratic control largely focuses on macro-level governance. In contrast, we propose a new approach grounded in social choice theory, which we term collective control of artificial intelligence. We argue that collective input can and should be incorporated at multiple points across the ML development pipeline, from data collection through objective design to alignment. We further demonstrate that social choice provides a well-suited modelling language for the treatment of collective input across all stages and that its axiomatic methodology yields principled criteria for evaluating various control mechanisms. Overall, our conceptual contribution provides a mathematically grounded framework to implement and analyse collective control of AI systems.
[AI-289] MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling
链接: https://arxiv.org/abs/2605.16290
作者: Dhriti Krishnan,Jaromir Savelka
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Predicting the difficulty of multiple-choice questions (MCQs) is important for effective assessment, yet current methods typically assume a unimodal student ability distribution, overlooking the heterogeneous nature of student misconceptions. We propose a persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling. Using student interactions from the EEDI dataset, we identify behavioral personas via latent class analysis (LCA), then condition a large language model (LLM) to simulate response distributions for each persona. These signals are aggregated with topic context and fed into a Ridge Regression model to predict the item response theory (IRT) difficulty parameter. With five-fold cross-validation, our method improves over a recent baseline (MSE: 0.367 to 0.274; R2: 0.525 to 0.686). The discovered personas are interpretable and offer insights into why items are difficult, with potential applications to diagnostic assessment design.
[AI-290] Homoglyph-based Adversarial Perturbation of Introductory Computer Science Theory Problems
链接: https://arxiv.org/abs/2605.16286
作者: Aidan Alexander,Chitrangada Juneja,Napaluck Tontrasathien,Miro Vanek,Reyan Ahmed,Saumya Debray,Sazzadur Rahaman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Different AI tools such as ChatGPT, Gemini, and Claude are becoming very popular. Although they are helpful for many day-to-day tasks, they can be used in unexpected ways. For example, the learning objectives of a course may not be achieved if students use these tools to solve their homework problems. This paper proposes a simple method to address this issue in the lazy student model. The method uses homoglyph-based adversarial perturbation to first modify the question without changing the semantic meaning of the question. Then a few characters are perturbed by their homoglyphs. Our experimental result shows the theoretical problems of introductory computer science courses can be effectively perturbed. We also propose an interactive tool to conveniently use our method.
[AI-291] Measuring Changes in Instructor Class Design and Student Learning After the Release of Large Language Models (LLM s)
链接: https://arxiv.org/abs/2605.16284
作者: Amanda Potasznik,Daniel Haehn
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Student use of Generative AI (GenAI) products in completing their classwork, with or without their professors’ knowledge and/or approval, has resulted in substantial shifts in higher education. While GenAI use is widespread, its impact on student study methods, faculty course development, grade reporting, and overall learning is not well documented. This is a mixed-methods, multi-course study using retrospective quantitative analysis, instructor surveys, and anonymous student surveys at a university in the New England region of the United States. This research seeks to identify and document patterns in student and faculty perceptions of, and experiences in, the use of LLMs as a learning tool inside and outside of the university classroom. Alongside quantitative and thematic analysis of both faculty and student survey responses, historical grade data as reported to the university registrar is used to triangulate the phenomenon of learning achievement in pre- and post-LLM eras. It is hoped that this research can serve as a pilot study for a broader set of institutions. Results from this study can inform GenAI policy for professors, universities, and other educational institutions that are trying to maximize student learning in the age of AI.
[AI-292] he Recovery Mechanism: Technology Education and What Happens When the Pattern Breaks
链接: https://arxiv.org/abs/2605.16283
作者: Aysa X. Fan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:For centuries, each new technology has automated some layer of cognitive work and been absorbed by education retreating upward to teach the skills machines could not yet reach. Generative AI may be the first technology to break that pattern, because it now operates at the top of the cognitive ladder, where education has always escaped to. The risk is not that AI replaces teachers but that it replaces the productive struggle through which understanding forms. Drawing on historical analysis, labor economics, and new large-scale data on how students and workers actually use AI, this essay surfaces a paradox: the same technology that augments today’s skilled workforce may be quietly eroding the developmental process that produces tomorrow’s. Current assessment tools cannot yet distinguish students who are building capacity from those who are losing it. The essay argues this is a measurement problem first and a design problem second, and proposes a research agenda focused on learning outcomes rather than usage patterns. Ultimately, it asks what education should become once AI can perform the cognitive work education was built to develop, and offers directions rather than a destination. Capacities like judgment, character, and epistemic identity have not been central to mainstream educational taxonomies, because earlier technologies did not require education to reach so high.
[AI-293] axonomy and Consistency Analysis of Safety Benchmarks for AI Agents
链接: https://arxiv.org/abs/2605.16282
作者: Miles Q. Li,Benjamin C. M. Fung,Boyang Li,Heba Ismail,Farkhund Iqbal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall’s W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.
[AI-294] From Reactive to Proactive: A Multi-Regulatory Empirical Analysis of 480 AI Incidents and a Data-Driven Governance Compliance Framework
链接: https://arxiv.org/abs/2605.16281
作者: Ummara Mumtaz,Summaya Mumtaz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence systems are increasingly deployed in high-stakes domains, yet it remains unclear whether existing governance frameworks ensure accountability after deployment. This study makes two contributions. First, it presents a cross-regulatory empirical analysis of 480 real-world AI incidents from the AI Incident Database (AIID), evaluating their alignment with post-deployment provisions in three major governance frameworks: the EU AI Act (Articles 72-73), the NIST AI Risk Management Framework (MANAGE and GOVERN functions), and the General Data Protection Regulation (GDPR Articles 22, 33-35). The results reveal substantial governance gaps across these frameworks, indicating persistent weaknesses in post-deployment accountability. Second, based on these findings, the study proposes the Proactive AI Governance Compliance Framework (PAGCF), a four-phase lifecycle methodology designed to shift governance from reactive incident response toward pre-deployment compliance assurance. The framework includes risk-stratified governance tiers, an implementation checklist linked to specific regulatory provisions, and a projected impact analysis that uses internal monitoring as a proxy for proactive governance capacity.
[AI-295] Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech
链接: https://arxiv.org/abs/2605.16280
作者: Oskar von Cossel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Extended version of a paper accepted at ICAIL 2026. 10 pages, 1 figure, 2 tables
Abstract:Automating legal reasoning forces a choice between imperfect alternatives: symbolic systems offer transparency but struggle with ambiguity, whereas neural systems handle natural language flexibly but lack verifiability. This paper investigates whether a hybrid, neuro-symbolic approach can reconcile this trade-off. We evaluate this architecture in the domain of online content moderation, which serves as a proxy for high-volume legal decision-making such as mass administrative proceedings. In these settings, operators must assess thousands of cases daily under strict legal standards. Specifically, we examine whether constraining large language models (LLMs) within deterministic symbolic scaffolds improves statute-grounded illegality assessment while preventing “scope drift” (where LLMs conflate moral offensiveness with legal illegality). We evaluate the neuro-symbolic variant of Rulemapping - a visual logic-tree method that operationalises the classic legal syllogism - on online hate-speech classification under §130(1) of the German Criminal Code. Across diverse LLMs, Rulemapping maintains high recall (0.82-0.89) while achieving precision of 0.80-0.86, compared to 0.34-0.49 for unconstrained prompting. Expert-authored symbolic scaffolds thus enable robust legal automation aligned with regulatory requirements for auditability and verifiable decision-making.
[AI-296] Generative AI and Two-Tiered Online Mental Health Communities
链接: https://arxiv.org/abs/2605.16279
作者: Manyang Zhang,Jinyang Zheng,Zhijun Yan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Online mental health communities (OMHCs) are tiered platforms that connect patients with licensed counselors through public QA forums and paid private consultations. Their two-tier structure creates a strategic dilemma for genAI integration. Conversational agents can provide scalable and timely responses to a broader set of patients, alleviating persistent supply shortages, but their large-scale presence may also reshape counselors’ participation in providing nuanced expertise, emotionally sensitive support, and paid consultations, which are central to platform revenue and long-run sustainability. Leveraging a quasi-natural experiment from the integration of a genAI-based conversational agent in a leading OMHC, we examine how AI entry affects counselor participation. Using multiple identification strategies, we find that posting intensity increases significantly after AI integration, while average response length remains unchanged and per-post social recognition declines. Mechanism analyses show that AI improves responsiveness and expands patient engagement, enlarging counselors’ opportunity sets, with activity partially reallocated from a nearby non-AI subforum. Counselors respond heterogeneously: intrinsically motivated counselors reduce participation, whereas economically motivated counselors intensify competitive effort. These dynamics generate cross-tier spillovers: inactive counselors experience declines in paid consultations, while those who increase public participation preserve or expand downstream demand. Overall, our findings show that in tiered professional platforms, demand expansion and competitive incentives can outweigh intrinsic crowding-out.
[AI-297] Generative AI in K-12 Classrooms: A Midyear Implementation Report
链接: https://arxiv.org/abs/2605.16277
作者: Lief Esbenshade,Alex Liu,Michael Xiao,Zewei Tian,Min Sun,Zachary Zhang,Thomas Han,Yulia Lapicus,Kevin He
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This mid-year report summarizes teacher use of Colleague AI across 12 Washington State school districts from September 1 to December 31, 2025. Produced jointly by Colleague AI and this http URL at the University of Washington, this report aggregates platform data and district-provided administrative records to provide an early look at how teachers engaged with AI during the first half of the 2025-26 school year. The districts vary in size from small districts with a few thousand students to large districts with up to thirty thousand students. The districts are rural, suburban, and urban. Only a subset of districts were able to provide mid-year administrative data, and findings that link teachers’ use of Colleague AI to student characteristics should be interpreted as preliminary signals.
[AI-298] Agent Wall: A Runtime Safety Layer for Local AI Agents
链接: https://arxiv.org/abs/2605.16265
作者: Ashwin Aravind
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages, 2 figures, open-source implementation at this https URL
Abstract:The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent’s intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests. AgentWall is open-source at this https URL.
[AI-299] Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
链接: https://arxiv.org/abs/2605.16259
作者: Yoichi Ochiai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon’s unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs – including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models – and provide practical guidelines for diffusion model inference on Apple Silicon.
[AI-300] Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning
链接: https://arxiv.org/abs/2605.18656
作者: Arnab Auddy,Xiangni Peng,Subhadeep Paul
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:
Abstract:Federated Learning is a leading framework for training ML and AI models collaboratively across numerous user devices or databases. We study the trade-offs among estimation accuracy, privacy constraints, and communication cost for differentially private (DP) federated M estimation. The two standard methods in the literature are FedAvg, which may suffer from high federation bias, and FedSGD, which can incur high communication cost. Aimed at improving accuracy at a reduced communication cost, we propose FedHybrid, which uses FedSGD starting with an improved initialization by the FedAvg estimator. We propose FedNewton, which averages local Newton iterations to reduce bias in FedAvg, achieving an estimation accuracy comparable to FedSGD with much fewer communication rounds when the number of clients grows sufficiently slowly. We establish finite sample upper bounds on the mean-squared error rates of the DP versions of these estimators as functions of the number of clients, local sample sizes, privacy budget, and number of iterations. We further derive a minimax lower bound on the MSE of any iterative private federated procedure that provides a benchmark to assess the optimality gap of these methods. We numerically evaluate our methods for training a logistic regression and a neural network on the computer vision datasets MNIST and CIFAR-10.
[AI-301] Estimating Item Difficulty with Large Language Models as Experts
链接: https://arxiv.org/abs/2605.18562
作者: Diana Kolesnikova(1),Kirill Fedyanin(2),Abe D. Hofman(3 and 4),Matthieu J. S. Brinkhuis(5),Maria Bolsinova(1) ((1) Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands, (2) Smart Business Technologies, Belgrade, Serbia, (3) Department of Psychological Methods, University of Amsterdam, Amsterdam, Netherlands, (4) Prowise Learn, Amsterdam, Netherlands, (5) Department of Information and Computing Sciences, Utrecht University, Utrecht, Netherlands)
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: 24 pages, 2 figures, 9 tables
Abstract:Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.
[AI-302] AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers
链接: https://arxiv.org/abs/2605.18476
作者: Jungang Zou,Alex Ziyu Jiang,Qixuan Chen
机构: 未知
类目: Computation (stat.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Coding and computation remain major bottlenecks in Markov chain Monte Carlo (MCMC) workflows, especially as modern sampling algorithms have become increasingly complex and existing probabilistic programming systems remain limited in model support, extensibility, and composability. We introduce \textbfAI4BayesCode, an extensible LLM-driven system that translates natural-language Bayesian model descriptions into runnable, validated MCMC samplers. To improve reliability, AI4BayesCode adopts a modular design that decomposes models into modular sampling blocks and maps each block to a built-in sampling component, reducing the need to implement complex sampling algorithms from scratch. Reliability is further improved through pre-generation validation of model specifications and post-generation validation of generated sampler code. AI4BayesCode also introduces a novel recursively stateful coding paradigm for MCMC, allowing modular sampling components, potentially developed by different contributors, to be composed coherently within larger MCMC procedures. We develop a benchmark suite to evaluate AI4BayesCode for sampler-generation. Experiments show that AI4BayesCode can implement a wide range of Bayesian models from natural-language descriptions alone. As an open-ended system, its capability can continue to expand with improvements in the underlying AI agent and the addition of new built-in blocks.
[AI-303] Flowing with Confidence
链接: https://arxiv.org/abs/2605.18472
作者: Friso de Kruiff,Dario Coscia,Max Welling,Erik Bekkers
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative models can produce nonsensical text, unrealistic images, and unstable materials faster than simulation or human review can absorb; without per-sample confidence, trust erodes. Existing fixes run k ensembles or stochastic trajectories at k\times compute, measuring variability between models, not model confidence. We propose Flow Matching with Confidence (FMwC). FMwC injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost. The score supports multiple uses: filtering improves image quality and thermodynamic stability of crystals; editing rewinds trajectories to the points where the model commits and redirects them; and adaptive stepping concentrates ODE compute where the flow is ambiguous. We find that the confidence score correlates with the magnitude of the divergence of the learned velocity field, which gives us a window to understand the generative process, opening up surgical forms of guidance that target the moments that matter, new sampling algorithms and interpretability of generative models.
[AI-304] Qumus: Realization of An Embodied AI Quantum Material Experimentalist
链接: https://arxiv.org/abs/2605.18407
作者: Lihan Shi,Zhaoyi Joy Zheng,Xinzhe Juan,Yimin Wang,Ming Yin,Mayank Sengupta,Kristina Wolinski,Yanyu Jia,Jingzhi Shi,Derek Saucedo,Neill Saggi,Haosen Guan,Kenji Watanabe,Takashi Taniguchi,Ali Yazdani,Mengdi Wang,Sanfeng Wu
机构: 未知
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 29 Pages in total. Supplementary Demo Videos are available at this https URL
Abstract:While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.
[AI-305] Probing SMEFT Operators through tbarttbart Production with Hyper-Graph Neural Networks at the LHC
链接: https://arxiv.org/abs/2605.18382
作者: Amir Subba,Sanmay Ganguly
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注: 16 pages, 9 figures, 3 tables. Comments are welcome
Abstract:We present a phenomenological study of t\bartt\bart production in proton-proton collisions at \sqrts = 13 ~TeV, using a Hyper-Graph Neural Network (H-GNN) to discriminate multilepton signal events from the dominant SM backgrounds, namely t\bartW , t\bartZ , t\bartH , t\bartVV , single-top associated production, and diboson and triboson processes. In the H-GNN architecture each event is represented as a hypergraph whose nodes correspond to reconstructed jets and leptons and whose hyperedges encode higher-order correlations among arbitrary subsets of these objects, allowing the network to learn the many-body kinematic structures that characterize the t\bartt\bart final state. Combining same-sign di-lepton, tri-lepton, and four-lepton channels following a CMS-like event selection, the H-GNN attains an area under the ROC curve of 0.951 for the t\bartt\bart signal and yields a statistical significance of Z = 9.11 at an integrated luminosity of \mathcalL = 140~\mathrmfb^-1 , to be compared with Z = 8.62 for a SPANet baseline, Z = 7.37 for a Particle Transformer baseline, and Z = 5.13 obtained by the ATLAS analysis, evaluated under identical event selection. We exploit the improved signal extraction to derive one- and two-parameter 95% confidence level limits on the Wilson coefficients of the dimension-six operators \mathcalO_\Phi u , \mathcalO^(1)_tt , \mathcalO^(1)_qq , \mathcalO^(1)_qt , and \mathcalO^(8)_qt , and we project the expected sensitivity at the HL-LHC integrated luminosities of 1000~\mathrmfb^-1 and 3000~\mathrmfb^-1 with 50% uncertainty on the background estimation.
[AI-306] Symmetry-Compatible Principle for Optimizer Design: Embeddings LM Heads SwiGLU MLPs and MoE Routers
链接: https://arxiv.org/abs/2605.18106
作者: Tim Tsz-Kit Lau,Weijie Su
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.
[AI-307] Parameterized 4-Qubit EWL Quantum Game Circuits with Dirac-Solow-Swan Hamiltonian Integration for Quadruple Helix Disruptive Innovation Recommender Systems
链接: https://arxiv.org/abs/2605.18080
作者: Agung Trisetyarso,Fithra Faisal Hastiadi,Kridanto Surendro
机构: 未知
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Theoretical Economics (econ.TH)
备注: Submitted to Quantum
Abstract:We present a novel parameterized 4-qubit Eisert-Wilkens-Lewenstein (EWL) quantum game circuit for recommender systems in quadruple helix innovation ecosystems (academia, industry, government, and civil society). The local strategy operators U_i = R_y(\theta_i) for each helix actor are directly tuned by normalized dominance weights extracted from real participant funding data (\texitecContribution) in the European Commission CORDIS Horizon Europe database (project COVend, ID 101045956). The circuit employs a multi-qubit EWL entangler followed by parameterized local rotations, inverse entangler, and full measurement, achieving only 22 gates and circuit depth 11 while scaling as O(n) for n -round helix communications. Measurement probabilities after the quantum game serve as recommender scores for disruptive versus sustaining innovation trends. These scores are subsequently mapped into the diagonal Dirac potential of a Dirac-Solow-Swan Hamiltonian, enabling time-evolution simulation of capital accumulation and bifurcation dynamics under disruptive innovation. Numerical experiments on real CORDIS quadruple-helix collaboration networks demonstrate the circuit’s NISQ compatibility and its ability to forecast disruptive capital trajectories with high fidelity. The proposed framework bridges quantum game theory, parameterized quantum circuits, and relativistic economic growth models, offering a computationally efficient tool for innovation policy and strategic decision-making in complex socio-economic ecosystems. Complexity analysis and reproducibility are provided through open Qiskit implementations. Comments: Submitted to Quantum Subjects: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Theoretical Economics (econ.TH) Cite as: arXiv:2605.18080 [quant-ph] (or arXiv:2605.18080v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.18080 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-308] Quantum Sidecar Architectures for Hybrid AI Training and Inference: Stateful Protected Registers Stateless Reset-and-Reprepare Circuits and Quantum Weight-State Outlook
链接: https://arxiv.org/abs/2605.18031
作者: Y.Mo,G.D.Su
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures. Architecture and small-scale simulation study; no hardware experiment or quantum-advantage claim
Abstract:We propose a quantum sidecar architecture family for future hybrid AI training and inference. The central idea is not to store an entire Transformer in a small quantum memory, nor to claim one-shot collapse into a fully trained model or an optimal answer. Instead, we identify two physically distinct operating modes for quantum co-processors attached to classical large-model pipelines. The first is a stateful protected-register mode, in which a protected register stores a reusable quantum resource while an ancilla or temporary register performs QND-style readout. The second is a stateless reset-and-reprepare mode, in which each query prepares a task-conditioned quantum circuit, evolves over bounded training or inference control variables, measures candidate signals, resets the qubits, and repeats. We simulate the stateful mode using 2/4/6/8 protected-qubit density-matrix QND-style parity readout with one ancilla and a Qiskit cross-check. For the stateless mode, we include both an abstract candidate-update sampler and a circuit-level QAOA-style statevector sampler over structured candidate landscapes, followed by reset-overhead sensitivity analysis. The resulting framework positions quantum sidecars as bounded signal generators for optimizer-side sampling, adapter or expert selection, retrieval, routing, and reasoning-path proposal. As a speculative outlook, we introduce quantum weight-state sidecars: restricted quantum representations over model-control variables, not direct encodings of complete classical weight tensors.
[AI-309] raining Infinitely Deep and Wide Transformers
链接: https://arxiv.org/abs/2605.17660
作者: Raphaël Barboni,Maarten V. de Hoop,Takashi Furuya,Gabriel Peyré
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.
[AI-310] Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels
链接: https://arxiv.org/abs/2605.17559
作者: Binyamin Perets,Shie Mannor
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注: 9 pages
Abstract:Large-scale hypothesis testing is central to modern science, where controlling the False Discovery Rate (FDR) has become the standard approach to managing false positives across many simultaneous tests. Hypotheses rarely exist in isolation; they often exhibit structure through proximity, connectivity, or hierarchy. This structure represents both a challenge and an opportunity: while classical methods treat these dependencies as obstacles requiring conservative correction, leveraging them can substantially increase discovery power. Here, we reframe structured FDR control as a regularized learning problem. By optimizing within a suitable Reproducing Kernel Hilbert Space (RKHS), we introduce a framework that unifies continuous domains, graphs, and hierarchies under a single algorithm through kernel choice alone. This formulation enables smooth solutions in place of the piecewise-constant fits of prior methods, principled likelihood-based hyperparameter selection rather than heuristic tuning, and inference at unobserved locations which in turn supports sample-efficient experimental design. Building on this estimator, we provide two decision rules which we prove to control the FDR. We validate our method on two sources: spatial locations derived from high-dimensional real-world datasets, and a differential gene expression task utilizing protein-protein interaction graphs.
[AI-311] Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets
链接: https://arxiv.org/abs/2605.17307
作者: Kamil Kashif,Robert Ślepaczuk
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Trading and Market Microstructure (q-fin.TR)
备注: 67 pages, 11 figures, 16 tables
Abstract:This study develops and evaluates a deep reinforcement learning framework for dynamic portfolio allocation across global equity markets. The Soft Actor-Critic algorithm is used to learn continuous portfolio weights within a Markov Decision Process, incorporating transaction costs, turnover penalties, and diversification constraints into the reward function. Five model configurations are compared, varying in reward formulation, policy structure (flat versus hierarchical Dirichlet), portfolio constraints, and temporal encoder (LSTM versus Transformer), and evaluated via walk-forward optimization across sixteen out-of-sample folds spanning 2003-2026 on the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. Results show that RL strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50, where statistically significant abnormal returns are observed, but the central hypothesis is only partially confirmed: no strategy achieves statistically significant excess returns relative to Buy and Hold under HAC-robust inference across all markets. Regime analysis reveals that RL adds the most value during periods of elevated uncertainty, while ensemble aggregation across markets improves risk-adjusted performance and confirms the benefits of geographic diversification.
[AI-312] Global Automation Atlas
链接: https://arxiv.org/abs/2605.17086
作者: Prashant Garg,Tommaso Crosta,Jasmin Baier
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
备注: 65 pages, 6 figures. Data and code: this https URL
Abstract:Automation affects the labour content of work differently across different contexts. Yet, most existing exposure measures assign fixed scores to tasks or occupations, limiting comparisons of automation exposure across countries. We develop a task-based and country-specific approach to classify automation exposure across the world to disentangle labor-substituting from labor-augmenting automation, the relevant technology channel, and the material role of AI. Our measure spans 124 countries, generating an atlas of 2.33 million task-country labels for economies covering 99% of world population and GDP. We present five descriptive results. First, exposure is highly uneven, ranging from 3.3% of tasks in South Sudan to 61.6% in China, and rises strongly with income, although substantial variation remains within income groups. Second, across countries, exposed tasks are skewed towards substitution rather than augmentation, but low-income countries are disproportionately exposed to substitution, whereas middle-income countries are more heterogeneous. Third, less technologically advanced forms of automation account for more than half of exposed tasks in low-income countries but about one quarter in high-income countries; while other more complex channels generally rise with income levels. Fourth, AI tends to be less prevalent in simpler channels of automation, but also more prevalent in labour-substituting margins in lower income settings and to augment labour in higher income settings. Fifth, we find that females seem to be disproportionately more exposed to labour-substituting automation than males. Our methodology provides a basis for comparing automation exposure across development stages, linking it with cross-country data and allowing us to treat exposure levels, labour margins, technological channels and AI involvement as separate dimensions.
[AI-313] Prediction-Intervention Games and Invariant Sets
链接: https://arxiv.org/abs/2605.16828
作者: Linus Kühne,Felix Schur,Jonas Peters
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:
Abstract:We consider the following two-player game: using observational data, the leader chooses a prediction function for a response variable Y from given covariates. The follower then reacts with an intervention on some covariates in the underlying structural causal model to maximize their own objective. The leader knows the intervention targets, but may have limited knowledge of the follower’s objective. We call this setup a prediction-intervention game, a special case of a Stackelberg game. Finding an optimal strategy for the leader is generally difficult. To avoid severe performance loss, the leader may base their prediction on the causal parents of Y , or more generally on an invariant subset of covariates. We prove, for two common classes of follower objectives, that predictors based on the stable blanket, a specific invariant subset, are always better or as good as those based on the causal parents. We further upper bound the leader’s post-intervention risk by a worst-case risk over allowed interventions and strengthen existing distribution generalization results to analyze this bound: we give sufficient conditions under which stable-blanket predictors are worst-case optimal, and show by examples that these conditions cannot in general be dropped. Finally, we discuss practical strategies for settings with known and unknown graph, and test them on simulated and real-world data.
[AI-314] Universal Dynamics of Punctuated Progress
链接: https://arxiv.org/abs/2605.16719
作者: Yian Yin,Dashun Wang
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Scientific and technological frontiers advance through punctuated dynamics, yet the principles governing these dynamics remain poorly understood. Here we collect and analyze datasets tracking the evolution of frontiers across 9 different domains, spanning materials discovery, structural biology, AI, computational biomedicine, data science, theoretical computer science, Formula-1 racing, and physical wheel building. Analyzing 6.8M solutions to 6.7K tasks, we uncover three universal patterns: (1) waiting times between new frontiers are heavy-tailed, with most attempts concentrated in long stasis; (2) frontier records accumulate at a sublinear rate, faster than logarithmic yet slower than linear growth; (3) record-breaking events are temporally correlated, generating short-term predictability yet long-term unpredictability. Despite the differences in the scale, scope, and definition of the settings, these patterns are remarkably consistent across all domains we study, and are not captured by models from complex systems, record statistics, economics of innovation, and cultural evolution. We trace the missing ingredient to the distinction between radical and incremental innovation, and develop a minimal, analytically solvable model incorporating both radical resets that restructure what is achievable and incremental refinements that exploit the current frontier. The simple model reproduces all three empirical regularities. Remarkably, the leading-order predictions are parameter-independent, identifying a new universality class governing punctuated progress and yielding testable predictions about how openness and access to frontier solutions shape the pace of advance. Overall, these results reveal universal dynamics governing punctuated progress and identify the interplay between radical resets and incremental refinements as the key driver of how scientific and technological frontiers advance.
[AI-315] Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models
链接: https://arxiv.org/abs/2605.16571
作者: Anchit Jain,Kevin Zhang,Stephen Bates
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Time-to-event data is widespread across the life sciences and engineering, but it is typically encountered together with censoring, which complicates the application of standard machine learning methods. Deep Cox models have emerged as a popular method for analyzing time-to-event data because they gracefully handle censoring and can be used with unstructured data such as clinical text reports, genomic sequences, and pathology images. However, their predicted survival probabilities are often poorly calibrated, thus limiting their practical utility. In this paper, we propose a novel post hoc calibration method for Deep Cox models that uses isotonic regression to refine predicted survival probabilities without affecting discriminative power. We establish favorable theoretical guarantees, including a double-robustness property and asymptotic calibration. Experiments on synthetic and real-world clinical data demonstrate the empirical effectiveness of our method.
[AI-316] MoleCode unlocks structural intelligence in large language models
链接: https://arxiv.org/abs/2605.16480
作者: Zhiyuan Yan,Chen Liu,Boxuan Zhao,Kaiqing Lin,Jixiang Zhao,Yimi Wang,Liuzhenghao Lv,Hao Li,Shanzhuo Zhang,Li Yuan,Fanyang Mo
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Molecules are graphs, but large language models~(LLMs) are usually asked to reason about them through linear strings. The most popular molecular representation, SMILES, compresses atoms, bonds, branches and rings into a compact sequence in which topology is implicit, forcing LLMs to reconstruct molecular structure before performing the requested chemical operation. Here we introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. MoleCode makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across molecular reasoning, editing, generation and analysis tasks, this representational shift improves frontier LLMs most strongly when structural access is limiting: unfamiliar molecules, topology-sensitive operations, larger structures and repetitive polymers. It also changes how inference is allocated, replacing long reasoning traces devoted to implicit structural reconstruction with shorter, more chemically directed reasoning over explicit atoms and bonds. In molecular optimization, this enables localized, property-aligned edits that preserve structural similarity to the starting compounds. The same Subgraph–Node–Edge grammar extends beyond small molecules to polymers, Markush structures, mechanism-style transformations and interleaved scientific documents, including research articles and patent disclosures in which chemical information is distributed across text and images. These results suggest that the interface between scientific objects and LLMs should not treat structure as something to be decoded from text. When the object of reasoning is relational, the structure itself should be part of the language.
[AI-317] Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning
链接: https://arxiv.org/abs/2605.16391
作者: Jiarui Lv,Feng Zhu,Xiaohong Zhang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Inertial measurement units (IMUs) are fundamental sensing components in multi-source integrated navigation systems, and their performance directly determines the accuracy and reliability of solutions. However, the precision of low-cost IMUs is inherently constrained by hardware limitations. Recently, generative artificial intelligence has demonstrated remarkable capability in modeling complex data distributions and reconstructing high-fidelity signals. Motivated by this, we propose a diffusion-based generative learning framework for synthesizing high-fidelity virtual IMU data from low-cost IMU measurements. Specifically, a conditional diffusion model based on a U-Net architecture is constructed, where high-grade IMU measurements are utilized as ground-truth priors and low-cost IMU measurements are employed as conditional inputs. The virtual IMU data generated by the model is used for subsequent navigation and localization tasks. Experimental results demonstrate that the generated virtual IMU data significantly outperform the original low-cost IMU measurements in both positioning and attitude estimation. Furthermore, we transfer the model to airborne mapping experiments, where the proposed method produces thinner and more consistent point clouds. Overall, the proposed framework breaks the performance limits of low-cost IMU and demonstrates the potential of diffusion-based generative learning for virtual high-grade IMU data.
[AI-318] Retrieval and competition: how a protein foundation model starts a protein
链接: https://arxiv.org/abs/2605.16331
作者: Piotr Jedryszek,Oliver M. Crook
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident prediction reflects recognition of biological evidence or retrieval of a statistical default. We examine this distinction for a near-universal biological rule, that proteins begin with methionine, by tracing the computational pathway through which ESM2-8M produces this prediction. The model does not detect methionine at the masked position. Instead, it retrieves a methionine-favouring signal from a reference representation at the beginning-of-sequence token via a position-specific query assembled across layers, with the final output emerging through competition with context-dependent circuits. To understand how positional information reaches the readout, we introduce a norm-direction decomposition of attention scores within rotary frequency bands. Positional encoding operates through coupled changes in query norm and angular alignment distributed across these bands. On sequences whose true N-terminus is not methionine, where the biological question matters, the model predicts methionine anyway. This is not a correct prediction produced by an unexpected mechanism, but the output of a positional-prior retrieval circuit that matches the statistical average and fails where biology diverges from it. Distinguishing the two requires resolution at the level of individual circuits, frequency bands, and query composition, suggesting that mechanistic verification will be necessary, and challenging, for predictions where the biological stakes are higher. Even for the simplest biological rule, the model’s prediction is mediated by a distributed computational circuit rather than direct recognition, suggesting that increasing task complexity will further obscure the relationship between model confidence and underlying biological evidence.
[AI-319] A Machine Learning Framework for EEG-Based Prediction of Treatment Efficacy in Chronic Neck Pain
链接: https://arxiv.org/abs/2605.16326
作者: Xiru Wang,Aiden Li,Hongzhao Tan,Stevie Foglia,Aimee Nelson,Zhen Gao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 15 pages, 7 figures
Abstract:Chronic neck pain is a leading cause of disability worldwide, and current treatment selection remains largely trial and error. We present a machine learning framework that uses electroencephalography to predict treatment efficacy in patients with chronic neck pain, with the goal of supporting individualized therapy and reducing the burden on healthcare systems. The framework centers on a rigorous data preprocessing stage tailored to the characteristics of each EEG recording type. For resting-state EEG, the preprocessing pipeline comprises baseline signal removal, bad channel identification and exclusion, re-referencing, bandpass and notch filtering, Independent Component Analysis, and power spectral density analysis. For motor execution and motor imagery recordings, the same initial steps are applied, after which signals are aligned to trigger events so that event-related desynchronization (ERD) and event-related synchronization (ERS) can be quantified. Synchronously recorded electromyography data are bandpass filtered and smoothed with a moving average, then correlated with the corresponding EEG channels to characterize the EEG EMG relationship during attempted movement. In parallel, we performed an extensive literature review of machine learning models applied to clinical EEG (763 records initially screened, 16 patient and 47 healthy-control studies retained), to inform the post-processing strategy. Through this combined preprocessing and review effort, we aim to develop a robust predictive model that can support personalized healthcare strategies in chronic pain management.
机器学习
[LG-0] A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
链接: https://arxiv.org/abs/2605.18750
作者: Ruitao Liu(1),Xinyang Tian(1),Shuo Chen(1),Tingrui Zhang(1),Guang Yang(1),Alan Zhao(2),Wei Xu(1) ((1) Tsinghua University, (2) Scitix AI)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 29 pages, including appendices
Abstract:Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77 \times speedup on language-only workloads and up to 2.77 \times on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84 \times while preserving training correctness. Comments: 29 pages, including appendices Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2605.18750 [cs.DC] (or arXiv:2605.18750v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.18750 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Learning Normal Representations for Blood Biomarkers
链接: https://arxiv.org/abs/2605.18701
作者: Aashna P. Shah,Michelle M. Li,Yash Lal,Seffi Cohen,Liat F. Antwarg,Morgan Sanchez,James A. Diao,Chirag J. Patel,Ben Y. Reis,Ran D. Balicer,Noa Dagan,Arjun K. Manrai
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Blood-based biomarkers underpin clinical diagnosis and management, yet their interpretation relies largely on fixed population reference intervals that ignore stable, intra-patient variability. As such, population-based interpretation can mask meaningful deviation from an individual’s baseline, risking delayed disease detection. To remedy this, there have been increasing efforts to personalize blood biomarker interpretation using individual testing histories. However, these methods may overfit to sparse data, inflating false-positive rates and unnecessary follow-up, and can also unwittingly include unrecognized or subclinical disease. Here, we leverage nearly 2 billion longitudinal laboratory measurements from over 1.6 million individuals across North America, the Middle East, and East Asia, to show that while laboratory values are highly individual, purely personalized intervals routinely overfit, classifying up to 68% of measurements as abnormal, without corresponding associations with adverse clinical outcomes. We then introduce NORMA, a conditional transformer-based framework that generates reference intervals by conditioning on both a patient’s history and population-level data about “normal” variation. NORMA-derived intervals achieve higher precision for predicting outcomes, including mortality, acute kidney injury, and chronic disease. These findings caution against over-personalization in laboratory medicine and demonstrate that anchoring individual trajectories to population-level priors outperforms either approach alone. To promote transparency, we publicly release the model, code, and an interactive user interface for accessible, individualized laboratory interpretation.
[LG-2] A No-Defense Defense Against Gradient-Based Adversarial Attacks on ML-NIDS: Is Less More?
链接: https://arxiv.org/abs/2605.18666
作者: Mohamed elShehaby,Ashraf Matrawy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Gradient-based adversarial attacks subtly manipulate inputs of Machine Learning (ML) models to induce incorrect predictions. This paper investigates whether careful architectural choices alone can yield an inherently robust Deep Neural Network (DNN)-based Network Intrusion Detection Systems (NIDS), without any additional explicit defenses. Through thousands of experiments, around 2200, varying network depth, feature dimensionality, activation functions, and dropout across FGSM, PGD, and BIM attacks, we show that shallower networks, reduced feature sets, and ReLU activation consistently and jointly reduce adversarial vulnerability. Moreover, a simple model following this recipe outperforms deeper, fully-featured adversarially trained models, while maintaining near-perfect clean-traffic detection and lower training times. Nevertheless, while less is more, the selection of the right less is what truly matters.
[LG-3] Efficient and Noise-Tolerant PAC Learning of Multiclass Linear Classifiers
链接: https://arxiv.org/abs/2605.18662
作者: Rita Adhikari,Shiwei Zeng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Noise-tolerant PAC learning of linear models has been of central interests in machine learning community since the last century. In recent years, many computationally-efficient algorithms have been proposed for the problem of learning linear threshold functions under multiple noise models. Yet, when the problem is considered under multiclass learning settings, i.e. when the number of classes k is at least 3 , it is unknown whether there exist computationally-efficient PAC learning algorithms when the data sets are maliciously corrupted. In this paper, we consider that the marginal distribution is a mixture of bounded variance distributions and the data sets satisfy a margin condition at the same time. We show that there exists a computationally-efficient algorithm that PAC learns multiclass linear classifiers \h_w:x\mapsto \arg\max_y\in[k]w_y\cdot x, x\in \mathbbR^d, w\in\mathbbR^kd\ using at most O(k^2\cdot (d\log d+\log k)) samples even under a constant rate of nasty noise. Our algorithm consists of two main ingredients: a cluster-based pruning scheme and a standard multiclass hinge loss minimization program. Even in the special case of binary setting, i.e. k=2 , our result is strictly stronger than all prior works.
[LG-4] Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
链接: https://arxiv.org/abs/2605.18629
作者: Michał Brzozowski,Neo Christopher Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbfaligned training, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbfalignment score), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.
[LG-5] Learning to Look Benign: Targeted Evasion of Malware Detectors via API Import Injection
链接: https://arxiv.org/abs/2605.18624
作者: Juozas Dautartas,Olga Kurasova,Juozapas Rokas Čypas,Viktor Medvedev
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning-based malware detectors are widely deployed in antivirus and endpoint detection systems, yet their reliance on static features makes them vulnerable to adversarial manipulation. This paper investigates whether a malware sample can be intentionally misclassified as a specific benign software category, not merely as “not malware”, by adding a small number of Win32 API imports characteristic of that selected category, without removing any existing imports or retraining the detector. We propose a framework centered on a Conditional Variational Autoencoder (CVAE) whose decoder is strictly additive. It can introduce new API calls but never remove existing ones, preserving malware functionality by design. For each malware sample, the framework automatically identifies which benign category it most closely resembles and uses that as the evasion target. A knowledge-distilled differentiable proxy enables gradient-based training against the non-differentiable ensemble detector. Experiments on a six-class dataset of binary Win32 API import vectors extracted from 3,799 Windows executables (five benign categories, one malware class) show that, against a detector achieving 87.5% malware recall, adding just 20 API imports reduces recall to 30%. At k=20, among samples that evaded detection, 99% are classified as the intended target category. The CVAE outperforms both a frequency-based baseline and random selection at every tested injection size (k = 5 to 50). Validation on real PE files submitted to VirusTotal confirms that the attack transfers to commercial static detection engines, with an average 54.5% reduction in flagging engines. These findings expose a concrete vulnerability in API-based malware classifiers and demonstrate that targeted evasion into a chosen benign category is achievable with minimal, functionality-preserving modifications.
[LG-6] An Approximation Algorithm for Graph Label Selection ICML2026
链接: https://arxiv.org/abs/2605.18623
作者: Josia John,Maximilian Probst Gutenberg,Simon Meierhans
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026. 9 pages, 7 figures
Abstract:In the graph label selection problem, one is given an n -vertex graph and a budget k , and seeks to select k vertices whose labels enable accurate prediction of the labels on the remaining vertices. This problem formalizes distilling a small representative set from the whole graph. We present the first \tildeO(\log^1.5 n) -approximation algorithm for graph label selection under the standard budget constraint. Prior work either relies on resource augmentation, allowing substantially more than k labeled vertices, or consists primarily of heuristics without provable guarantees. Finally, we demonstrate that practical heuristic variants of our algorithm scale to significantly larger graphs than previous methods, while essentially retaining their quality.
[LG-7] Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration
链接: https://arxiv.org/abs/2605.18609
作者: Sachin Garg,Michał Dereziński
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak’s heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.
[LG-8] Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts
链接: https://arxiv.org/abs/2605.18606
作者: Jiaxiao Xu,Changhong Mou,Yeyu Zhang,Fengxiang He
类目: Machine Learning (cs.LG)
*备注: 36 pages, 14 figures, 10 tables
Abstract:Neural operators approximate PDE solution maps, but they need not respect the symmetries of the governing equation. In out-of-distribution (OOD) regimes, a standard neural operator must often learn coordinate alignment and physical evolution within a single map, which can hurt generalization. We use known continuous symmetries of evolution equations on periodic domains to separate these two roles. We propose the Physics-Aligned Canonical Equivariant Fourier Neural Operator (PACE-FNO), which estimates the input frame with a Lie-algebra coordinate estimator, maps the field to a reference frame, applies a standard Fourier Neural Operator (FNO), and restores the prediction to the target frame. We train alignment and operator prediction jointly using bounded symmetry perturbations, with an optional low-dimensional refinement step that updates the estimated frame at inference. Equivariance is enforced by the input and output transformations, while the FNO architecture remains unchanged. Across 1-D and 2-D Burgers, shallow-water, and Navier-Stokes equations on periodic domains, PACE-FNO matches the in-distribution (ID) accuracy of standard neural operators and reduces out-of-distribution (OOD) relative error by up to 12x over FNO with symmetry augmentation (FNO+Aug) under translations and Galilean shifts, with smaller gains for coupled rotation-translation shifts. Ablations show that aligning the input and restoring the output frame account for most OOD gains; inference-time refinement provides a smaller correction.
[LG-9] Pointwise Generalization in Deep Neural Networks
链接: https://arxiv.org/abs/2605.18598
作者: Shaojie Li,Yunbei Xu
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Functional Analysis (math.FA); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We address the fundamental question of why deep neural networks generalize by establishing a pointwise generalization theory for fully connected networks. This framework resolves long-standing barriers to characterizing the rich nonlinear feature-learning regime and builds a new statistical foundation for representation learning. For each trained model, we characterize the hypothesis via a pointwise Riemannian Dimension, derived from the eigenvalues of the learned feature representations across layers. This establishes a principled framework for deriving hypothesis-dependent, representation-aware generalization bounds. These bounds offer a systematic upgrade over approaches based on model size, products of norms, and infinite-width linearizations, yielding guarantees that are orders of magnitude tighter in both theory and experiment. Analytically, we identify the structural properties and mathematical principles that explain the tractability of deep networks. Empirically, the pointwise Riemannian Dimension exhibits substantial feature compression, decreases with increased over-parameterization, and captures the implicit bias of optimizers. Taken together, our results indicate that deep networks are mathematically tractable in practical regimes and that their generalization is sharply explained by pointwise, feature-spectrum-aware complexity.
[LG-10] S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs
链接: https://arxiv.org/abs/2605.18579
作者: Yuhan Wang,Haopeng Zhang,Yibo Ding,Jiaqi Yu,Xinyu Zhao,Yuhang Liu,Ziwei Zhang,Xiao Wang,Ruijie Wang
类目: Machine Learning (cs.LG)
*备注: 19 pages
Abstract:Pre-training on text-attributed graphs (TAGs) is central to building transferable graph foundation models, where LLM-as-Aligner methods align graph and text representations through the semantic knowledge of large language models. However, these methods usually assume that node texts provide sufficient and reliable supervision, an assumption often violated in real-world sparse TAGs. When textual anchors are missing, noisy, or uneven across domains, graph structures must be aligned with weak semantic evidence, leading to unreliable structure-semantics correspondence and sparsity-induced transfer bias. This paper presents S2Aligner, a sparsity-aware and structure-enhanced LLM-as-Aligner framework for graph-text pre-training on sparse TAGs. The key idea is to decouple semantic alignment from structural modeling, allowing topology-aware signals to enhance alignment without contaminating the shared semantic space. Specifically, S2Aligner decomposes graph-text representations into semantic and structural components, uses structure-oriented reconstruction with consistency control to inject reliable topology cues into text representations, and suppresses inconsistent structural signals under textual sparsity. Moreover, S2Aligner introduces sparsity-aware cross-domain risk balancing, which calibrates domain risks through a global-domain density ratio and downweights unreliable sparse samples via graph reliability estimation. Theoretical analysis shows that this objective reduces cross-domain generalization gaps by controlling domain risk discrepancy. Extensive experiments across diverse graph domains, sparsity levels, and downstream tasks demonstrate that S2Aligner consistently outperforms existing baselines.
[LG-11] scHelix: Asymmetric Dual-Stream Integration via Explicit Gene-Level Disentanglement KDD26
链接: https://arxiv.org/abs/2605.18576
作者: Xichen Yan,Zelin Zang,Changxi Chi,Jingbo Zhou,Chang Yu,Jinlin Wu,Shenghui Cheng,Fuji Yang,Jiebo Luo,Zhen Lei,Stan Z. Li
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, accepted by KDD 26
Abstract:A critical challenge in single-cell RNA sequencing (scRNA-seq) integration is resolving the tension between eliminating batch effects and maintaining biological fidelity. While recent evidence indicates that batch effects manifest heterogeneously across genes, most existing methods process the transcriptome uniformly, frequently resulting in over-correction and loss of subtle biological signals. To address this, we present scHelix, a dataset-adaptive framework that fundamentally changes how features are processed by explicitly partitioning genes into domain-invariant Anchors and domain-sensitive Variants at the input level. scHelix utilizes a dual-stream sparse diffusion encoder equipped with stop-gradient graph caching to efficiently learn multi-scale structural representations. The core of our approach is a novel asymmetric Align-Refine-Fuse protocol: the unstable Variant stream is first aligned to the robust topology of the Anchor stream, followed by a conservative refinement phase where the Anchor stream absorbs denoised details via bounded residual gating. This divide-and-conquer architecture prevents shortcut learning and ensures robust batch removal without compromising the integrity of biological clusters. Extensive benchmarking demonstrates that scHelix outperforms state-of-the-art methods.
[LG-12] Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data
链接: https://arxiv.org/abs/2605.18557
作者: Ariane Delrocq,Wu S. Zihan,Guillaume Bellec,Wulfram Gerstner
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:The brain learns abstract representations of high-dimensional sensory input, but the plasticity rules that enable such learning are unknown. We study biologically plausible algorithms on the Random Hierarchy Model (RHM), an artificial dataset designed to investigate how deep neural networks learn the intrinsic hierarchical structure of high-dimensional data. We focus on two types of local learning rules that avoid both a long convergence time and the use of a symmetric error network. The first type uses direct feedback signals to approximate error propagation from the output layer. The second type uses layerwise self-supervised contrastive or non-contrastive loss functions that do not explicitly approximate errors at the output layer. We show that all rules of the first type fail to solve the tasks of the RHM and trace this failure back to input-specific nonlinearities (`masking’) that are implemented in full backpropagation and are essential for learning complex tasks. However, algorithms of the second type are able to learn the hierarchical hidden structure of the RHM tasks and are as data-efficient as supervised backpropagation training, while being compatible with known rules of synaptic plasticity in cortex.
[LG-13] Federated Martingale Posterior Samping
链接: https://arxiv.org/abs/2605.18554
作者: Boning Zhang,Matteo Zecchin,Mingzhao Guo,Dongzhu Liu,Osvaldo Simeone
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 5 pages
Abstract:Federated Bayesian neural networks require fixing a prior on the model parameters together with a likelihood. Eliciting meaningful priors on the weight space of modern overparameterized models is notoriously difficult, and misspecification of either component can severely degrade accuracy and calibration. Motivated by the rapid progress of predictive models such as large language models, the martingale posterior, also known as predictive Bayes, replaces the prior–likelihood pair with a predictive distribution and recovers parameter uncertainty by repeatedly drawing predictive samples and refitting the model. A direct federated implementation, however, would require clients to share the local data sets. This letter proposes federated martingale posterior (FMP) sampling, a one-shot embarrassingly parallel protocol in which each client uploads a small set of trainable data embeddings and the server runs the predictive sampler centrally. Experiments on MNIST, CIFAR-10, and CIFAR-100 show that FMP closely matches the centralized counterpart and significantly improves calibration over consensus-style baselines.
[LG-14] Protein Fold Classification at Scale: Benchmarking and Pretraining ICML2026
链接: https://arxiv.org/abs/2605.18552
作者: Dexiong Chen,Andrei Manolache,Mathias Niepert,Karsten Borgwardt
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注: Accepted at ICML 2026 (spotlight)
Abstract:Classifying protein topology is essential for deciphering biological function, but progress is held back by the lack of large-scale benchmarks that avoid duplicates and by models that do not scale well. We introduce TEDBench, a large-scale, non-redundant benchmark for protein fold classification constructed from the Encyclopedia of Domains (TED) and Foldseek-clustered AlphaFold structures. We show that on TEDBench, current protein representation learning methods either require very large models or fail to deliver strong performance. To address this challenge, we propose Masked Invariant Autoencoders (MiAE), a self-supervised framework for protein structure representation learning. MiAE uses an extremely high masking ratio of up to 90% with an \mathrmSE(3) -invariant encoder and a lightweight decoder that reconstructs backbone coordinates from the latent representation and mask tokens. MiAE scales well and outperforms supervised counterparts and state-of-the-art baselines on TEDBench, establishing a strong recipe for protein fold classification. To test transfer beyond AlphaFold structures, we further benchmark on a curated dataset from experimental structures of CATH v4.4. TEDBench is available at this https URL.
[LG-15] XCTFormer: Leverag ing Cross-Channel and Cross-Time Dependencies for Enhanced Time-Series Analysis
链接: https://arxiv.org/abs/2605.18534
作者: Israel Zexer,Omri Azencot
类目: Machine Learning (cs.LG)
*备注: TMLR 2026
Abstract:Multivariate time-series analysis involves extracting informative representations from sequences of multiple interdependent variables, supporting tasks such as forecasting, imputation, and anomaly detection. In real-world scenarios, these variables are typically collected from a shared context or underlying phenomenon, suggesting the presence of latent dependencies across time and channels that can be leveraged to improve performance. However, recent findings show that channel-independent (CI) models, which assume no inter-variable dependencies, often outperform channel-dependent (CD) models that explicitly model such relationships. This surprising result indicates that current CD models may not fully exploit their potential due to limitations in how dependencies are captured. Recent studies have revisited channel dependence modeling with various approaches; however, these methods often employ indirect modeling strategies, which can lead to meaningful dependencies being overlooked. To address this issue, we introduce XCTFormer, a transformer-based channel-dependent (CD) model that explicitly captures cross-temporal and cross-channel dependencies via an enhanced attention mechanism. The model operates in a token-to-token fashion, modeling pairwise dependencies between every pair of tokens across time and channels. The architecture comprises (i) a data processing module, (ii) a novel Cross-Relational Attention Block (CRAB) that increases capacity and expressiveness, and (iii) an optional Dependency Compression Plugin (DeCoP) that improves scalability. Through extensive experiments on three time-series benchmarks, we show that XCTFormer achieves strong results compared to widely recognized baselines; in particular, it attains state-of-the-art performance on the imputation task, outperforming the second-best method by an average of 20.8% in MSE and 15.3% in MAE.
[LG-16] Offline Contextual Bandits in the Presence of New Actions
链接: https://arxiv.org/abs/2605.18509
作者: Ren Kishimoto,Tatsuhiro Shimizu,Kazuki Kawamura,Takanori Muroi,Yusuke Narita,Yuki Sasamoto,Kei Tateno,Takuma Udagawa,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: 12pages, 7 figures
Abstract:Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an existing action set. However, in many real-world scenarios, actions, such as news articles or video content, change continuously, and the action space evolves over time after data collection. We define actions introduced after deploying the logging policy as new actions and focus on OPL with new actions. Existing OPL methods identify optimal actions from the existing set effectively but cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. We first introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, generalizing the PseudoInverse estimator initially proposed for off-policy evaluation of slate bandits. LCPI controls the trade-off between reward-modeling condition and the condition for data collection regarding the action features, capturing the interaction effects among different dimensions of action features. Furthermore, we propose a generalized algorithm called Policy Optimization for Effective New Actions (PONA), which integrates LCPI, a component specialized for new action selection, with Doubly Robust (DR), which excels at learning within existing actions. We define PONA as a weighted sum of the LCPI and DR estimators, optimizing both the selection of existing and new actions, and allowing the proportion of new action selections to be adjusted by the weight parameter. Through extensive experiments, we demonstrate that PONA efficiently selects new actions while maintaining the overall policy performance as opposed to most existing methods that cannot select new actions.
[LG-17] Adaptive Experimentation for Censored Survival Outcomes
链接: https://arxiv.org/abs/2605.18459
作者: Yuxin Wang,Dennis Frauen,Jonas Schweisthal,Maresa Schröder,Emil Javurek,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Adaptive experimentation enables efficient estimation of causal effects, but existing methods are not designed for survival data with censoring, where event times are only partially observed (e.g., overall survival in cancer trials but with dropout). In this paper, we develop a novel framework for adaptive experimentation to estimate causal effects under right censoring. For this, we derive the semiparametric efficiency bound for the average survival effect curve as a function of the treatment allocation policy and thereby obtain a closed-form efficiency-optimal allocation policy. The policy generalizes classical Neyman allocation to survival settings by prioritizing patient strata where both event and censoring dynamics induce high uncertainty. Building on this, we propose the Adaptive Survival Estimator (ASE), an adaptive framework that learns the allocation policy and estimates the average survival effect curve sequentially. Our framework has three main benefits: (i) it accommodates arbitrary machine learning models for nuisance estimation; (ii) it is guided by a closed-form efficiency-optimal allocation policy; and (iii) it admits strong theoretical guarantees, including asymptotic normality via a martingale central limit theorem. We demonstrate our framework across various numerical experiments to show consistent efficiency gains over uniform randomization and censoring-agnostic baselines.
[LG-18] Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach
链接: https://arxiv.org/abs/2605.18437
作者: Yaorong Huang,Jingtao Luo,Xuechao Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Vehicular edge computing (VEC) enables latency-sensitive vehicular applications by offloading computation-intensive tasks to nearby edge servers. However, real-world vehicular workloads are typically modeled as heterogeneous directed acyclic graph (DAG) tasks with complex dependency structures, making joint offloading and resource allocation highly challenging. Moreover, distributed MEC deployment raises privacy concerns when collaboratively training learning-based policies. In this paper, we propose a Federated Meta Deep Reinforcement Learning framework with GAT-Seq2Seq modeling (FedMAGS) for heterogeneous task offloading in VEC systems. The proposed approach leverages Graph Attention Networks to capture DAG dependencies, a Seq2Seq-based policy to generate structured offloading decisions, and federated meta-learning to enable fast adaptation across distributed MEC servers without sharing raw data. Extensive simulations demonstrate that FedMAGS achieves faster convergence, lower execution delay, and better scalability compared with state-of-the-art baselines. In addition, the federated design preserves data privacy while reducing communication overhead, making the framework well suited for dynamic and large-scale VEC environments. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.18437 [cs.LG] (or arXiv:2605.18437v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-19] xt2CAD-Bench: A Benchmark for LLM -based Text-to-Parametric CAD Generation
链接: https://arxiv.org/abs/2605.18430
作者: Liang Wang,Heng Meng,Zekai Xiang,Jin Liu,Pingyi Zhou,Litao Chen,Yongqiang Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Text-to-CAD generation aims to create parametric CAD models from natural language, enabling rapid prototyping and intuitive design workflows. However, existing benchmarks focus on basic primitives and simple sketch-extrude sequences, lacking advanced features essential for real-world applications and covering only traditional mechanical parts. We introduce Text2CAD-Bench, the first benchmark systematically evaluating text-to-CAD across geometric complexity and application diversity. Our benchmark comprises 600 human-curated examples spanning four levels: L1-L2 cover fundamental geometry with standard features, L3 introduces complex topology and freeform surfaces, and L4 extends to real-world domains beyond mechanical parts. Each example pairs dual-style prompts – geometric descriptions mimicking non-expert users, and procedural sequences aligned with expert-level conventions. Evaluating mainstream general LLMs and domain-specific models, we find that current models perform reasonably on basic geometry but degrade substantially on complex topology and advanced features. We release our benchmark to drive progress in text-to-CAD research.
[LG-20] Generative Adversarial Learning from Deterministic Processes
链接: https://arxiv.org/abs/2605.18425
作者: Joris C. Kühl,Hanno Gottschalk
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 37 pages, 3 figures
Abstract:Physical AI is being successfully applied to data which does not follow the traditional paradigm of independent and identically distributed (i.i.d.) samples. In fact, physical AI is often trained on data which is not random at all, and is instead derived from chaotic dynamical systems like turbulence. We aim to explain the empirical success of these methods using the example of generative adversarial networks (GANs), whose statistical learning theory under the i.i.d. assumption is generally well understood. We prove that it is possible, using an infinite-dimensional model of generative adversarial learning (GAL), to learn the invariant distribution of a sufficiently chaotic dynamical system from a single deterministically evolving time series of its states or measurements thereof, and give explicit rates for the convergence to the solution in terms of the Jensen-Shannon divergence. Comments: 37 pages, 3 figures Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: 62G20, 68T05 (Primary) 37D25 (Secondary) Cite as: arXiv:2605.18425 [cs.LG] (or arXiv:2605.18425v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Spherical Harmonic Optimal Transport: Application to Climate Models Comparisons
链接: https://arxiv.org/abs/2605.18389
作者: Pierre Houédry,Iskander Legheraba,Léo Buecher,Nicolas Courty
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Optimal transport provides a powerful framework for comparing measures while respecting the geometry of their support, but comes with an expensive computational cost, hindering its potential application to real world use cases. On manifolds, convolutional algorithms based on the heat kernel have been proposed to alleviate this cost, but their theoretical properties remain largely unexplored. We establish that the heat kernel cost converges to the optimal transport cost as time vanishes in the balanced and unbalanced cases. In the specific case of the 2-sphere \mathbbS^2 , we ensure that the associated Sinkhorn divergences retains the desirable geometric and analytic properties of classical optimal transport discrepancies. Moreover, we leverage the harmonic structure of the sphere to derive a fast Sinkhorn algorithm, requiring only \mathcalO(n) memory and \mathcalO(n^3/2) time per iteration, with fully dense GPU-friendly operations. We validate its computational efficiency on synthetic data, and discuss its potential use in the evaluation of global climate models, providing both spatial and seasonal insights into models performances.
[LG-22] abH2O: A Unified Foundation Model for Tabular Prediction
链接: https://arxiv.org/abs/2605.18383
作者: Pascal Pfeiffer,Dmitry Gordeev,Mathias Müller,Laura Fink,Joan Salvà Soler,Mark Landry,Branden Murray,Marcos V. Conde,Sri Satish Ambati
类目: Machine Learning (cs.LG)
*备注: Technical Report - this https URL
Abstract:We present TabH2O, a foundation model for tabular data that performs classification and regression in a single forward pass via in-context learning. TabH2O builds on the TabICL architecture with several key modifications: (1) unified training, a single model handles both classification and regression via a dual-head architecture, eliminating the need for separate models and reducing total pretraining cost; (2) single-stage pretraining, training stability improvements (bounded scalable softmax, inter-stage normalization, learnable residual scaling, logit soft-capping) eliminate the need for multi-stage curriculum learning, enabling training with full-length sequences from the start; and (3) noise-aware pretraining, synthetic datasets include explicit noise dimensions to teach the model robustness to irrelevant features. We evaluate TabH2O v1 (29.2M parameters) on the TALENT benchmark (300 datasets), where it achieves an average rank of 2.55 out of 6 evaluated methods, outperforming tuned CatBoost (4.07), H2O AutoML (4.18), and LightGBM (5.08), competitive with TabPFN v2.6 (2.74), and behind TabICL v2 (2.12), while placing in the top-3 on 81% of the testing datasets across classification and regression tasks.
[LG-23] Generating Physically Consistent Molecules with Energy-Based Models
链接: https://arxiv.org/abs/2605.18381
作者: Christoph Griesbacher,Lea Bogensperger,Andreas Habring,Thomas Pock
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecules in equilibrium follow a Boltzmann distribution, making the underlying energy landscape a physically grounded modeling objective. However, such landscapes are difficult to learn from data and, once learned, hard to sample from. Diffusion and flow-matching models sidestep these difficulties by learning a time-conditional score or transport field between noise and data, losing the energy inductive bias in exchange for a more tractable training objective. We introduce EBMol, an energy-based model (EBM) that restores this inductive bias by learning an atom-additive scalar potential without explicit simulation during training. Our method employs a flow-inspired Restoring Field Matching objective to approximate the energy landscape. We adopt the Mirror-Langevin algorithm for sampling, enabling unified updates of atomic positions and types, and incorporate parallel tempering for inference-time compute scaling. EBMol is the first EBM for 3D molecular generation to achieve state-of-the-art performance on QM9 and GEOM-Drugs. Moreover, we show that the learned energy landscape serves as a principled quality metric for ranking and filtering configurations, and demonstrate controllable generation without retraining through shape-steered sampling via potential composition and zero-shot linker design.
[LG-24] Beyond Square Roots: Explicit Memory-Efficient Factorization for Multi-Epoch Private Learning
链接: https://arxiv.org/abs/2605.18379
作者: Nikita P. Kalinin,Aki Rehn,Joel Daniel Andersson,Antti Honkela,Christoph H. Lampert
类目: Machine Learning (cs.LG)
*备注:
Abstract:Correlated-noise mechanisms are among the most promising approaches for improving the utility of differentially private model training, but rigorous guarantees require explicit, analyzable factorizations, and practical deployment requires memory efficiency. Recent works have developed banded inverse factorizations, which address both requirements by exploiting a banded structure in the correlation matrix. The bandwidth controls the size of the noise buffer used to correlate noise across iterations, and thus governs the tradeoff between utility and memory cost. Existing factorizations highlight this tradeoff: DP- \lambda CGD achieves high memory efficiency by using only a one-step noise buffer, but this limits its utility gains, while the banded inverse square root (BISR) factorization exploits larger correlation windows and is asymptotically optimal for large bandwidths but performs poorly at low bandwidths. We propose \gamma -BIFR, a unified generalization of both factorizations. In the low-memory, low-bandwidth regime, \gamma -BIFR significantly improves RMSE, amplified RMSE, and private training performance, while yielding tighter theoretical guarantees for multi-participation error in multi-epoch training.
[LG-25] Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control ICRA
链接: https://arxiv.org/abs/2605.18373
作者: Edoardo Caldarelli,Franco Coltraro,Adrià Colomé,Lorenzo Rosasco,Carme Torras
类目: Robotics (cs.RO); Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: Accepted for presentation at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
Abstract:Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.
[LG-26] Proximal basin hopping: global optimization with guarantees
链接: https://arxiv.org/abs/2605.18364
作者: Guillaume Lauga(LJAD),Cesare Molinari(MaLGA),Samuel Vaiter(CNRS,LJAD)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Global optimization is a challenging problem, with plenty of algorithms displaying empirical success, but scarce theoretical backing. In this work, we propose a new theoretical framework called Proximal Basin Hopping (PBH), carefully tailored to combine proximal optimization and local minimization. We use it to construct a practical algorithm that converges to the global minimizer with high probability, when using a finite amount of samples. Proximal Basin Hopping outperforms well known algorithms with theoretical backing on standard synthetic hard functions, and real problems such as fitting scaling laws for deep learning. Furthermore, the higher the dimension, the better the performance gap.
[LG-27] Decoupled Conformal Optimisation: Efficient Prediction Sets via Independent Tuning and Calibration ICML2026
链接: https://arxiv.org/abs/2605.18354
作者: Fanyi Wu,Lihua Niu,Samuel Kaski,Michele Caprio
类目: Machine Learning (cs.LG)
*备注: 33 pages, 6 figures, accepted by ICML 2026 Workshop: Epistemic Intelligence in Machine Learning
Abstract:Bayesian conformal optimisation methods often use the same held-out data both to search for efficient prediction sets and to certify coverage or risk. This coupling is natural for high-probability risk-control guarantees, but it is not necessary when the target is standard finite-sample marginal conformal coverage. We propose Decoupled Conformal Optimisation (DCO), a train-tune-calibrate design principle that uses an independent tuning split for efficiency-oriented structural selection and a fresh calibration split for the final conformal quantile. Conditional on the tuned structure, standard split-conformal exchangeability yields finite-sample marginal coverage for any candidate class, without a confidence parameter or multiple-testing correction. DCO therefore targets a different finite-sample guarantee from PAC-style methods: marginal conformal coverage rather than high-probability risk control. Under consistency assumptions on the coupled risk bound, the two approaches nevertheless converge to the same population threshold. Across classification and regression benchmarks, including ImageNet-A, CIFAR-100, Diabetes, California Housing, and Concrete, DCO tracks the nominal coverage level closely while often reducing average prediction-set size or interval width relative to PAC-style calibration. On ImageNet-A, for example, the average set size decreases from 26.52 to 25.26 and the 95th-percentile set size from 58.95 to 53.73 ; on Diabetes, the average interval width decreases from 2.098 to 1.914 .
[LG-28] Prune Update and Trim: Robust Structured Pruning for Large Language Models
链接: https://arxiv.org/abs/2605.18331
作者: Diego Coello de Portugal Mecke,Tom Hanika,Lars Schmidth-Thieme
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) have experienced significant growth and development in recent years. However, performing inference on LLMs remains costly, especially for long-context inference or in resource-constrained devices. This motivates the development of new post-training pruning (PTP) methods. These methods reduce LLMs’ requirements by removing a substantial part of the model’s parameters. The discarded weights are selected depending on their impact on the models performance. Current PTP methods prune the models by removing the less informative hidden nodes from the FFN layers, and the least important attention layers. We propose Putri, a PTP method that introduces three changes to the State- of-the-art. First, we update the un-pruned weights of the FFN to compensate for the introduced pruning error. Second, the FFN layers are pruned sequentially, taking into account the updates done to the previous layers. Third, instead of removing full attention layers, we remove individual attention-heads. We extend this method such that it can also address Grouped-Query Attention. In summary, Putri is a structure pruning method which remains simple while showing SOTA performance. Pruning experiments on multiple models with a wide variety of sparsity ranges and on different datasets, validate the generality of Putri. Notably, we demonstrate that, unlike previous methods, Putri can prune LLMs on extreme sparsity ratios. The code is available at: this https URL.
[LG-29] he Symmetries of Three-Layer ReLU Networks
链接: https://arxiv.org/abs/2605.18319
作者: Johanna Marie Gegenfurtner,Moritz Grillo,Guido Montúfar
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Algebraic Geometry (math.AG); Combinatorics (math.CO)
*备注:
Abstract:We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers and yields a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Finally, we show that some of these symmetries induce local conservation laws along gradient flow, while others do not.
[LG-30] Dynamic Elliptical Graph Factor Models via Riemannian Optimization with Geodesic Temporal Regularization
链接: https://arxiv.org/abs/2605.18316
作者: Chuansen Peng,Xiaojing Shen
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Inferring time-varying graph structures from high-dimensional nodal observations is a fundamental problem arising in neuroscience, finance, climatology, and beyond. Two intrinsic challenges govern this problem: maintaining the \emphtemporal coherence of the latent graph across successive observation windows, and respecting the \emphintrinsic Riemannian geometry of the symmetric positive definite manifold on which precision matrices naturally reside, a curved space whose geodesic structure departs fundamentally from that of the ambient Euclidean space. In this paper we propose dynamic estimation on the Grassmann manifold with a factor model (\textscDegfm), a novel algorithm that jointly addresses both challenges. We model the time-varying precision matrix sequence as a low-rank-plus-diagonal structure governed by a latent elliptical graph factor model, which drastically reduces the effective parameter count and enables reliable estimation in the challenging small-sample regime. Temporal coherence is enforced through a Riemannian geodesic penalty defined on the Grassmann manifold, ensuring that the estimated graph trajectory is smooth with respect to the intrinsic geometry rather than the ambient Euclidean space. To solve the resulting non-convex optimization problem over Grassmann-manifold-valued sequences subject to the LRaD constraint, we derive an efficient Riemannian gradient descent algorithm that respects the manifold structure at every iterate and rigorously establish its convergence to a stationary point. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that \textscDegfm consistently outperforms state-of-the-art baselines across all evaluation metrics, confirming the practical effectiveness of the proposed framework.
[LG-31] mporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling
链接: https://arxiv.org/abs/2605.18281
作者: Afiq Abdillah Effiezal Aswadi,Oliver Britton,Ross Baker,Matthew Farrugia-Roberts
类目: Machine Learning (cs.LG)
*备注: Presented at Technical AI Safety Conference (TAIS), Oxford, May 2026. Code available at this https URL
Abstract:Modern deep learning science often assumes that neural networks learn from a fixed data distribution. However, many practically important learning problems involve data distributions that change throughout training. How does such non-stationarity impact the inductive biases of deep learning towards models with different structural, generalisation, and safety properties? A fruitful testbed for studying inductive bias is in-context linear regression sequence modelling, where small transformers display strikingly different generalisation patterns depending on the diversity of the (fixed) training task distribution. In this paper, we explore the effect of diversifying the task distribution across training time, finding that such temporal diversity leads to an increased bias towards generalisation over memorisation.
[LG-32] Attacking the First-Principle: A Black-Box Query-Free Targeted Mimicry Attack on Binary Function Classifiers
链接: https://arxiv.org/abs/2605.18231
作者: Gabriel Sauger(UL,CNRS,LORIA,Inria),Jean-Yves Marion(UL,CNRS,LORIA,Inria),Sazzadur Rahaman,Victor Matrat(CNRS,UL,LORIA,Inria),Vincent Tourneur(UL,CNRS,LORIA,Inria),Muaz Ali
类目: Machine Learning (cs.LG)
*备注:
Abstract:Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.
[LG-33] UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction
链接: https://arxiv.org/abs/2605.18188
作者: Robson W. S. Pessoa,Julien Amblard,Alessandra Russo,Idelfonso B.R. Nogueira
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection in batch processes is hindered by transient dynamics, scarce fault labels, and reliance on single-modality sensor data. This work introduces UTOPYA (Unified Temporal Observation for Physics-Informed Anomaly Detection and Time-Series Prediction), a 15.2M-parameter multimodal framework that jointly addresses anomaly detection, time-series prediction, and phase classification in batch distillation by fusing eight data modalities through Feature-wise Linear Modulation (FiLM) conditioned cross-modal attention and gated fusion. A physics-informed regularisation scheme introduced in this work enforces temporal smoothness and thermodynamic monotonicity, while curriculum learning introduces training samples in order of physical difficulty. On the 119-experiment multimodal batch distillation dataset of Arweiler et al. (2026), UTOPYA achieves a window-level test AUROC of 0.832 and 0.874 under multi-signal experiment-level scoring, substantially outperforming four external baselines (PCA, autoencoder, Isolation Forest, and LSTM autoencoder) evaluated under identical conditions (+0.147 window-level AUROC over the best baseline). A multimodal ablation over 15~architectural configurations shows that static context via FiLM conditioning is the key enabler, lifting experiment-level multi-signal AUROC by +0.145 over the unimodal baseline (0.729 to 0.874). Separately, a training ablation across 14 design choices reveals that several widely-adopted techniques, including instance normalisation, Mixup, ensembling, test-time augmentation, and stochastic weight averaging, fail to improve or actively degrade generalisation in this data-scarce setting. These negative results expose a fundamental tension between smoothing-based regularisation and anomaly detection, providing practical guidance for multimodal process monitoring deployment.
[LG-34] Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
链接: https://arxiv.org/abs/2605.18174
作者: Abdurakhmon Sadiev,Artavazd Maranjyan,Ivan Ilin,Peter Richtárik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized (L_0, L_1) -smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.
[LG-35] Elastic-dLLM : Position Preserving Context Compression and Augmentation of Diffusion LLM s
链接: https://arxiv.org/abs/2605.18165
作者: Junyi Wu,Tianchen Zhao,Shaoqiu Zhang,Linfeng Zhang,Guohao Dai,Yu Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM’s redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.
[LG-36] Foundation Models for Credit Risk Prediction: A Game Changer?
链接: https://arxiv.org/abs/2605.18147
作者: Bart Baesens,Andreas Goethals,Stefan Lessmann,Simon De Vos,Cristián Bravo,David Martens,Victor Medina-Olivares,Christophe Mues,Maria Oskarsdóttir,Seppe vanden Broucke,Tim Verdonck,Wouter Verbeke
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive models play a pivotal role in credit risk management, guiding critical decisions through accurate estimation of default probabilities and losses. Extensive research has introduced new modeling techniques, complemented by large-scale benchmarking studies consolidating the state-of-the-art. Today, quasi-standards such as gradient-boosting models paired with SHAP explainers have emerged, yet continuous improvement of risk models remains a top priority. Concurrently, rapid advancements in AI, most notably large language models, have disrupted predictive modeling paradigms. Foundation models, pretrained on extensive datasets from diverse domains, have demonstrated remarkable performance by leveraging prior knowledge. While prevalent in natural language processing and computer vision, foundation models for tabular data have only recently emerged. We conjecture that pretraining on out-of-domain data is particularly beneficial in small-data settings, such as SME lending or specialized corporate portfolios, and may help address longstanding challenges including low default portfolios and class imbalance. This paper benchmarks recently proposed tabular foundation models against a broad set of competitors, including established and advanced machine learning techniques, across two core tasks: PD and LGD modeling. Our evaluation encompasses various datasets, performance indicators, and experimental conditions. We find that tabular foundation models generally perform best across datasets and tasks. Moreover, they offer significant improvement in predictive performance as dataset size shrinks. These results are remarkable given that the models are tested out-of-the-box, without hyperparameter tuning, ensuring ease of use and mitigating computational costs.
[LG-37] pyforce-1.0.0: Python Framework for data-driven model Order Reduction of multi-physiCs problEms
链接: https://arxiv.org/abs/2605.18082
作者: Stefano Riva,Yantao Luo,Carolina Introini,Antonio Cammi
类目: Machine Learning (cs.LG)
*备注: Github Repo: this https URL
Abstract:pyforce is a Python package implementing Data-Driven Reduced Order Modelling techniques for applications to multi-physics problems, mainly set in the Nuclear Engineering world. The package is part of the ROSE (Reduced Order modelling with data-driven techniques for multi-phySics problEms): mathematical algorithms aimed at reducing the complexity of multi-physics models (for nuclear reactors applications), at searching for optimal sensor positions and at integrating real measures to improve the knowledge on the physical systems. With respect to the previous original implementation based on dolfinx package (v0.6.0), version 1.0.0 of pyforce has been completely re-written using pyvista as backend for mesh importing, computing integrals, and visualisation of results; in addition, functions are stored as numpy arrays, improving the ease of use of the package. This choice allows to use pyforce with any software solver able to export results in VTK format.
[LG-38] Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry
链接: https://arxiv.org/abs/2605.18078
作者: Yevhen Shcherbinin,Arina Redina,Maxim Kalpin,Vlad Kochetov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-agent policy-gradient methods have been shown to converge locally near stable Nash equilibria. Local convergence, however, does not determine which equilibrium is reached. We study this question through basin-entry probability with respect to a target set of equilibria selected by an external criterion, such as payoff dominance. For finite-unroll Meta-MAPG, we show that the update decomposes into ordinary policy gradient plus own-learning and peer-learning corrections, with controlled sampling noise and finite-unroll bias. We identify the peer-learning correction as the main equilibrium-selection mechanism: under a local alignment condition, the probability of entering the certified attraction region of the target stable-Nash set increases, relative to ordinary policy gradient. Because persistent correction may shift zero-update points of the original game, annealing the correction after entering the basin recovers ordinary policy-gradient dynamics and inherits local stable-Nash convergence guarantees. Experiments in Stag Hunt, iterated Prisoner’s Dilemma, and preliminary neural-policy coordination environments support this basin-entry view, showing increased entry into cooperative basins under peer-aware updates.
[LG-39] Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation ICIP
链接: https://arxiv.org/abs/2605.18020
作者: M Yashwanth,Arunabh Singh,Ashok Nayak,Sai Kiran Bulusu,Anirban Chakraborty
类目: Machine Learning (cs.LG)
*备注: Federated Learning, Rational Clients, Endogenous Participation, and Aggregation
Abstract:Federated Learning (FL) algorithms implicitly assume that clients passively comply with server-side orchestration by sharing local model updates upon server request. However, this overlooks an important aspect in real-world cross-silo environments: clients are often rational agents who may prioritize their utilities such as local model performance over that of the global model. In settings with significant statistical heterogeneity, rational clients may opt out of the federation if the perceived benefits of collaboration fail to meet their local utility thresholds. Such attrition degrades the global model performance and can lead to the collapse of the federated training process. In this work, we introduce FedUCA, (Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation), a framework that formalizes the server’s role as an optimizer seeking to maximize global model performance by sustaining client participation. We substantiate our framework through extensive experiments on standard datasets demonstrating that by prioritizing participation feasibility, FedUCA achieves significantly higher client retention and, consequently, a superior global model performance.
[LG-40] LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems
链接: https://arxiv.org/abs/2605.18015
作者: Mert Coskuner,Merve Zeybel,Melik Mert Dolan
类目: Machine Learning (cs.LG); Databases (cs.DB); Software Engineering (cs.SE)
*备注:
Abstract:Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM’s national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.
[LG-41] Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography
链接: https://arxiv.org/abs/2605.18008
作者: Mohammad Moulaeifard,Ciaran Bench,Philip J. Aston,Nils Strodthoff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 2 figures
Abstract:Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation Comments: 23 pages, 2 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.18008 [cs.LG] (or arXiv:2605.18008v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-42] Scalable Decision-Focused Learning through Cost-Sensitive Regression
链接: https://arxiv.org/abs/2605.18005
作者: Noah Schutte,Senne Berden,Tias Guns,Krzysztof Postek,Neil Yorke-Smith
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 7 figures
Abstract:Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These predict-then-optimize' or contextual optimization’ problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.
[LG-43] RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search ICML2026
链接: https://arxiv.org/abs/2605.18004
作者: Jinglong Xiong,Xiaotian Liu,Ruoxin Wang,Zihang Liu,Yefan Zhou,Yujun Yan,Yaoqing Yang
类目: Machine Learning (cs.LG)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 9 pages main text; 21 pages total
Abstract:Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to high-performing RLA algorithms. In this paper, we present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode inductive bias specific to the RLA domain; (2) Monte Carlo Graph Search, which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods, including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability. Code is available at this https URL.
[LG-44] Function graph transformers universally approximate operators between function spaces
链接: https://arxiv.org/abs/2605.17968
作者: Takashi Furuya,David Mis,Ivan Dokmanić,Maarten V. de Hoop,Matti Lassas
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function h is represented by its graph measure \gamma_h , with finite tokens (x_j,h(x_j))_j=1^N being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.
[LG-45] Enhancing the Code Reasoning Capabilities of LLM s via Consistency-based Reinforcement Learning
链接: https://arxiv.org/abs/2605.17958
作者: Zhanyue Qin,Jia Feng,Yibo Lyu,Yun Peng,Dianbo Sui,Cuiyun Gao,Qing Liao
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Under review
Abstract:Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.
[LG-46] InfoFlow: A Framework for Multi-Layer Transformer Analysis
链接: https://arxiv.org/abs/2605.17930
作者: Penghao Yu,Haotian Jiang,Zeyu Bao,Qianxiao Li
类目: Machine Learning (cs.LG)
*备注: 36 pages
Abstract:While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least \Omega (\varepsilon^-k) parameters to achieve precision \varepsilon , where k grows linearly with sequence length T , whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most O (\varepsilon^-1) parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for k -th largest retrieval with k \geq 2 . Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.
[LG-47] ransfer Learning for Customized Car Racing Environments
链接: https://arxiv.org/abs/2605.17928
作者: Benedict Florance Arockiaraj,Richard Chang,Wesley Yee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI’s Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.
[LG-48] Lightweight Gaussian Process Inference in C on Metal and CUDA
链接: https://arxiv.org/abs/2605.17898
作者: Yu-Hsueh Fang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths – exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT – covering problems from N=100 to N=500,000 . On an Apple M4, LightGP CPU is 2.6–8.7 \times faster than GPyTorch CPU for exact GP and \sim1.5\times faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3–6.7 \times faster than GPyTorch CUDA for exact GP up to N=2,048 , with GPyTorch closing the gap at N=4,096 . A fused matrix-free kernel-vector product on Metal achieves 32 \times over the explicit path at N=20,000 with O(N) memory, and an FFT-accelerated SKI matvec via Accelerate vDSP runs in sub-millisecond time at N=200,000 . LightGP compiles as a single static library with zero external dependencies and is installable via \textttpip install lightgp
[LG-49] CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution
链接: https://arxiv.org/abs/2605.17889
作者: Mu-Young Son,Yi Chen,Seungjae Yoo,Soongyu Choi Joo-Young Kim
类目: Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, accepted to DAC '26
Abstract:The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively. Comments: 7 pages, 8 figures, accepted to DAC '26 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.17889 [cs.LG] (or arXiv:2605.17889v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17889 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770743.3804296 Focus to learn more DOI(s) linking to related resources
[LG-50] DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data
链接: https://arxiv.org/abs/2605.17866
作者: Masahiro Suzuki,Bohui Xia,Hiroto Yamamoto,Masanori Miyahara
类目: Machine Learning (cs.LG)
*备注:
Abstract:Small-scale data is a critical problem in time-series forecasting tasks. Data augmentation is an effective strategy for this task, but it has a limitation in generating meaningful data. To address this limitation, we propose DAD4TS, a diffusion-model-based data augmentation method with reinforcement learning, designed for time-series forecasting with small-scale data. In DAD4TS, a data generator is simultaneously trained with a time-series model and controlled by a reinforcement learning model to efficiently generate samples that improve the forecast accuracy of the time-series model. To support small-scale data, we use mathematical methods instead of conventional VAE methods to train the diffusion model by projecting the time-series data into the geometric space. We validated the effectiveness of DAD4TS with seven comparative methods through qualitative and quantitative experiments on six real-world datasets and eight time-series models. As a result, DAD4TS was validated on five datasets.
[LG-51] Learning over Positive and Negative Edges with Contrastive Message Passing
链接: https://arxiv.org/abs/2605.17854
作者: Peter Pao-Huang,Charilaos I. Kanatsoulis,Michael Bereket,Jure Leskovec
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conventional approaches to learning on graphs involve message passing along existing (i.e., positive) edges to update node features. However, these approaches often disregard the potentially valuable information contained in the absence (i.e., negative) of edges. Here, we theoretically analyze the value of negative edges in graph representations and prove that in settings of low label rates, high homophily, and high edge density, access to negative edges provides significant information gain over using only positive edges. Motivated by this insight, we introduce Contrastive Message Passing (CMP), a general message passing architecture that enable graph neural network layers to reason over positive and negative edges. By imposing soft positive semidefinite constraints on the learnable weights, our approach differentially applies similarity-preserving transformations to positively connected nodes and dissimilarity-inducing transformations to negatively connected nodes. Over simulated and real datasets in varying data regimes, CMP consistently outperforms baselines in low-label settings when negative edges are informative.
[LG-52] SNLP: Layer-Parallel Inference via Structured Newton Corrections
链接: https://arxiv.org/abs/2605.17842
作者: Ligong Han,Kai Xu,Hao Wang,Akash Srivastava
类目: Machine Learning (cs.LG)
*备注: Project webpage: this https URL
Abstract:Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians with cheap architecture-induced surrogate dynamics. In residual Transformers, this yields Identity Newton (IDN), where the correction reduces to a prefix-sum-like update; in mHC-style architectures, HC Newton (HCN) uses the model’s residual mixing matrix. We further introduce SNLP-aware regularization, which trains models to make one or a few structured Newton iterations accurately approximate the sequential forward. Experiments on nanochat-scale Transformers show that SNLP regularization improves layer-parallel compatibility and can also improve standard sequential perplexity, reducing baseline PPL by 4.7%-23.4%. At inference time, SNLP combined with layer fusion and chunkwise decomposition achieves practical wall-clock speedups: on a 0.5B Nanochat model, it reaches 2.3x speedup while still improving PPL by 6.1%. These results suggest that layer-parallel inference is not merely a numerical approximation to sequential execution, but can act as a useful solver-induced inference bias. We also characterize limitations: off-the-shelf pretrained models are less amenable to this procedure, and exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling.
[LG-53] Agent ic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics
链接: https://arxiv.org/abs/2605.17831
作者: Mahdi Naser-Moghadasi
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 8 pages, preprint, code at this https URL
Abstract:Query optimization in big data analytics remains computationally expensive, particularly for resource-constrained environments where traditional optimizers fail to satisfy memory and latency constraints. We present an agentic query planning system that combines a rule-based teacher planner, UCB1 bandit exploration, cost-aware prediction, and knowledge distillation to a lightweight student planner. Our teacher planner generates SQL plans using six key optimization strategies, while UCB1 bandit search efficiently explores the plan space under explicit resource constraints. A Random Forest cost model predicts query latency from plan features, enabling cost-aware decisions. A distilled student planner (Logistic Regression or Gradient Boosting) learns to mimic teacher-bandit decisions for fast inference. Evaluation on NYC Taxi and IMDB datasets demonstrates 23% latency reduction compared to default planners while maintaining 94% constraint satisfaction. The student planner achieves 89% accuracy in replicating optimal plans with 15x faster inference time. Our single-file implementation enables reproducible big-data analytics on resource-limited machines and is publicly available at this https URL.
[LG-54] A Unified Framework for Data-Free One-Step Sampling via Wasserstein Gradient Flows
链接: https://arxiv.org/abs/2605.17808
作者: Chenguang Wang,Tianshu Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We develop a unified theoretical framework for data-free one-step sampling from unnormalized target distributions based on Wasserstein gradient flows. For a broad class of standard f-divergence objectives, we show that the induced velocity field admits the universal form \mathbfV(x)=w(r(x)),\beta(x) , where \beta(x)=\nabla \log (p(x)/q(x)) is shared across objectives and w is determined solely by the choice of divergence. This decomposition shows that standard f-divergence drifts share the same asymptotic target distribution p and differ primarily in how they redistribute transient repair effort across under-covered regions. To formalize this distinction, we derive a one-step regional-response theory for a soft under-coverage functional and obtain a compression–elasticity identity that links divergence choice to the geometry of mass transport into under-covered regions. We further extend the framework beyond the f-divergence family to the Log-Variance (LV) divergence, analyze how the reference distribution alters the resulting drift structure, and motivate a practical LV-inspired surrogate for data-free training. Based on this theory, we instantiate the framework with a KDE-based implementation and describe a complementary normalizing-flow route, enabling one-step inference after training. Experiments on multimodal Gaussian-mixture benchmarks are consistent with the theoretical predictions and demonstrate effective one-step sampling on these targets.
[LG-55] AMO: Adaptive Muon Orthogonalization
链接: https://arxiv.org/abs/2605.17806
作者: Xinlin Zhuang,Panyi Ouyang,Yichen Li,Jiangming Shi,Yizhang Chen,Shuman Liu,Ying Qian,Weiyang Liu,Haibo Zhang,Imran Razzak
类目: Machine Learning (cs.LG)
*备注: preprint, under-review
Abstract:Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training. AMO delivers consistent improvements over uniform-schedule Muon across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average downstream performance of 12 evaluation tasks.
[LG-56] GenTS: A Comprehensive Benchmark Library for Generative Time Series Models
链接: https://arxiv.org/abs/2605.17804
作者: Chenxi Wang,Xiaorong Wang,Peiyang Li,Yi Wang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Generative models have demonstrated remarkable potential in time series analysis tasks, like synthesis, forecasting, imputation, etc. However, offering limited coverage for generative models, existing time series libraries are mainly engineered for discriminative models, with standardized workflows for specific tasks, such as optimizing Mean Squared Errors for time series forecasting. This rigid structure is fundamentally incompatible with the distinct and often complex paradigms of generative models (e.g., adversarial training, diffusion processes), which learn the underlying data distribution rather than a direct input-output mapping. To this end, we proposed GenTS, a comprehensive and extensible benchmark library designed for systematic assessment on generative time series models. GenTS features a unified data preprocessing pipeline, a collection of versatile models, and panoramic evaluation metrics. Its modular design also enables the researchers to flexibly customize beyond our built-in datasets and models. Based on GenTS, we conducted benchmarking experiments under diverse tasks, accordingly offering suggestions for model selection and identifying potential directions for future research. Our codes are open-source at this https URL. The official tutorials and document are available at this https URL.
[LG-57] HydroAgent : Closing the Gap Between Frontier LLM s and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL
链接: https://arxiv.org/abs/2605.17792
作者: Zhi Li,Songkun Yan,Jie Cao,Mofan Zhang,Anjiang Wei,Jinwoong Yoo,Yang Hong
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.
[LG-58] Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
链接: https://arxiv.org/abs/2605.17787
作者: Athanasios Glentis,Dawei Li,Chung-Yiu Yau,Mingyi Hong
类目: Machine Learning (cs.LG)
*备注:
Abstract:It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD’s inability to sustain learning rates comparable to Adam’s much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam’s performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.
[LG-59] Learning Variable-Length Tokenization for Generative Recommendation
链接: https://arxiv.org/abs/2605.17779
作者: Minhao Wang,Bowen Wu,Wei Zhang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Generative recommendation reformulates recommendation as next-token prediction over discrete semantic identifiers (IDs). A fundamental yet unexplored design choice is that existing methods employ fixed-length tokenization for all items, implicitly assuming uniform encoding capacity regardless of item characteristics. Through systematic experiments across four datasets, we discover the Popularity-Length Paradox: popular items achieve optimal performance with short IDs, while tail items require substantially longer codes to capture discriminative semantics. This reveals a critical mismatch where popular items benefit from abundant collaborative signals and require minimal semantic detail, whereas tail items must rely on fine-grained content features due to sparse interaction data. To address this, we propose VarLenRec, a framework for learning variable-length tokenization. We develop Popularity-Weighted Information Budget Allocation (PIBA), an information-theoretic framework proving that optimal ID length should scale as a negative power of popularity. Directly implementing variable-length allocation faces two technical challenges: standard Euclidean residual quantization lacks geometric capacity to support diverse code lengths without distortion, and discrete length decisions are non-differentiable. We address these through Hyperbolic Residual Quantization, which leverages the exponential volume growth of the Poincaré ball to naturally stratify encoding capacity, and a Soft Length Controller, which enables differentiable length prediction via continuous layer retention probabilities regularized by PIBA-derived priors. Extensive experiments demonstrate that VarLenRec achieves significant improvements over state-of-the-art methods in recommendation accuracy and training/inference efficiency, revealing the importance of adaptive encoding capacity in generative recommendation. Comments: 13 pages, 5 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.17779 [cs.LG] (or arXiv:2605.17779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models
链接: https://arxiv.org/abs/2605.17765
作者: Yuanyun Zhang,Shi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent healthcare foundation models have achieved strong predictive performance through large scale self supervised learning, yet their latent representations frequently entangle physiologic severity, intervention intensity, observational structure, and institutional workflow into shared embedding directions. While effective for downstream prediction, such representations remain semantically opaque and unstable under contextual shift. We introduce AURORA, Adaptive Uncertainty aware Representations through Orthogonalized Relational Alignment, a new framework for healthcare representation learning based on contextual latent geometry. Rather than optimizing a single unified embedding manifold, AURORA decomposes representations into orthogonal semantic subspaces corresponding to distinct contextual factors and learns relational consistency objectives within each subspace. This induces latent spaces that are both semantically disentangled and geometrically interpretable. Across multiple clinical prediction and retrieval tasks, AURORA consistently outperforms reconstruction, contrastive, and self distillation baselines while substantially improving contextual disentanglement, neighborhood purity, and robustness under institutional distribution shift. Our results suggest that latent geometry itself constitutes an important axis of healthcare foundation model design and that explicitly structuring representation space according to contextual semantics provides a complementary direction beyond conventional predictive compression objectives.
[LG-61] MV-Gate: Insider Threat Detection via Multi-View Behavioral Statistics and Semantic Modeling
链接: https://arxiv.org/abs/2605.17761
作者: Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Guanggang Geng
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: Accepted by The 29th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2026)
Abstract:Insider threats often reveal early anomalies through disruptions in behavioral statistics-such as altered recurrence patterns or short-versus long-term frequency shifts-rather than changes in event semantics. Yet, as the field has shifted from statistical modeling to log tokenization and deep sequential encoders, these statistical cues are weakened or lost, leaving current models insensitive to gradual and low-visibility insider this http URL propose MV-Gate, a multi-view behavior modeling framework that explicitly integrates statistical regularities with sequence semantics. MV-Gate constructs three aligned behavioral sequences: activity tokens, multi-scale status signals capturing recurrence patterns, and frequency-deviation signals describing short- vs long-term intensity differences. An anomaly-aware gating mechanism injects these statistical views into the attention computation, guiding the encoder to emphasize statistically irregular events. Experiments on CERT r4.2, CERT r5.2, and ADFA-LD show that MV-Gate achieves notable gains over classical, deep-learning, and domain-specific baselines, particularly for progressive, weak-signal threats. These results highlight the necessity of jointly modeling statistical and sequential evidence for robust insider-threat detection.
[LG-62] Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets
链接: https://arxiv.org/abs/2605.17758
作者: Nitish Nagesh,Mahdi Bagheri,Arshia Harish Puthran,Pengbao Zhou,Muhjaazee Love,Aadi Sharma,Ian Harris,Amir M. Rahmani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.
[LG-63] stable and Actionable Calibration for Full Swap Regret
链接: https://arxiv.org/abs/2605.17749
作者: Konstantina Bairaktari,Lunjia Hu,Huy L. Nguyen,Jonathan Ullman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.
[LG-64] oy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
链接: https://arxiv.org/abs/2605.17704
作者: Alon Bebchuk,Nir Shavit
类目: Machine Learning (cs.LG)
*备注:
Abstract:The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,‘’ that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.
[LG-65] Exact Convex Reformulations of Linear Neural Networks via Completely Positive Lifting
链接: https://arxiv.org/abs/2605.17692
作者: Karthik Prakhya,Alp Yurtsever
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We show that the training problem of a deep linear neural network under the squared loss admits an exact convex reformulation in a lifted space over a generalized completely positive cone. The reformulation has the same optimal value as the original nonconvex problem and is linear in the lifted variables, with all nonconvexity encoded in the cone constraint. Its ambient lifted dimension depends only on the input and output dimensions, independent of the network depth and the number of data points, and the bottleneck width enters only through scalar constraints. The construction proceeds by reducing the multilayer parameterization to a bilinear factorization, lifting it to a rank-constrained semidefinite program, expressing the rank constraint via a complementarity condition, and applying a completely positive lifting. While the resulting formulation is computationally intractable in general, it gives an exact conic representation of the nonconvexity induced by linear factorization and connects linear neural network training with copositive programming.
[LG-66] Bug or Feature2: Weight Drift Activation Sparsity and Spikes
链接: https://arxiv.org/abs/2605.17659
作者: Egor Shvetsov,Aleksandr Serkov,Shokorov Viacheslav,Redko Dmitry,Vladislav Goloshchapov,Evgeny Burnaev
类目: Machine Learning (cs.LG)
*备注:
Abstract:The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above \sim 70% activation sparsity. While ReLU ^2 achieves a good sparsity–accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU ^2 outperforms its unclipped version, and GELU ^2 achieves the lowest validation loss on GPT-nano. Code is available at this https URL.
[LG-67] When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering
链接: https://arxiv.org/abs/2605.17658
作者: Erik Imgrund,Pia Hanfeld,Klim Kireev,Konrad Rieck
类目: Machine Learning (cs.LG)
*备注:
Abstract:Different age-related regulations have been proposed to protect minors from harmful content and interactions online. Automated age estimation is central to enforcing such regulations, and vision-language models (VLMs) achieve state-of-the-art performance on this task. However, we find that the zero-shot nature of VLM-based age estimation produces an unexpected side effect we call the identity shortcut: Instead of estimating age from visual features, VLMs tend to identify the depicted person and infer their age from memorized knowledge. This phenomenon leads to substantially incorrect predictions when non-celebrities are misidentified as celebrities. It also produces deceptively high robustness to noise and adversarial perturbations on celebrity images, which dominate popular benchmarks. To mitigate this, we propose an activation steering method that suppresses the shortcut by intervening on the hidden states of the VLM. This method improves age estimation accuracy for both memorized and unseen identities, reducing mean absolute error by up to 25% across popular benchmarks.
[LG-68] Counterfactual Explanations Under Concept Drift
链接: https://arxiv.org/abs/2605.17651
作者: Marcin Kostrzewa,Jerzy Stefanowski,Maciej Zięba
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations (CFEs) provide actionable recourse, but most methods assume a static framework with fixed data and a trained classifier. This assumption breaks in evolving data environments, such as data streams, where online models are repeatedly updated under concept drift. We identify CFE maintenance in this setting as a previously overlooked problem: explanations that are valid when generated may silently become invalid as the model evolves, including robust CFEs, which are not designed for continuous drift. We propose a lightweight, model-agnostic update scheme that repairs existing CFEs using local sampling to estimate validity and plausibility directions while preserving proximity to the original instance. Experiments on synthetic drifting streams show that initially created CFEs rapidly lose validity, whereas maintained CFEs preserve validity and local plausibility at a lower cost than repeated regeneration.
[LG-69] abKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates
链接: https://arxiv.org/abs/2605.17642
作者: Meysam Alishahi,Yan Zheng,Junpeng Wang,Chin-Chia Michael Yeh,Jeff M. Phillips
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular data generation considers a large table with multiple columns – each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data – without just copying those initial rows. The last 4 years have seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion. This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and leakage-avoidance achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data – we believe the first for generative modeling – and as a result, require significantly less space as well. Our code is available here: \urlthis https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.17642 [cs.LG] (or arXiv:2605.17642v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17642 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] Verifier-Guided Code Translation via Meta-Step Decoding
链接: https://arxiv.org/abs/2605.17626
作者: Tianyang Zhou,Somesh Jha,Mihai Christodorescu,Kirill Levchenko,Varun Chandrasekaran
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 31 pages, 8 figures
Abstract:Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.
[LG-71] VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
链接: https://arxiv.org/abs/2605.17613
作者: Jiayi Yao,Samuel Shen,Kuntai Du,Shaoting Feng,Dongjoo Seo,Rui Zhang,Yuyang Huang,Yuhan Liu,Shan Lu,Junchen Jiang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2605.17613 [cs.AR] (or arXiv:2605.17613v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2605.17613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-72] A Feature-Driven Framework for Software Fault Prediction
链接: https://arxiv.org/abs/2605.17611
作者: Ahmad Nauman Ghazi,Nagajyothi Devarapalli,Ashir Javeed,Sadi Alawadi,Fahed Alkhabbas,Khalid AlKharabsheh
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Pages 1-9, Preprint, Accepted for publication in FLICS2026
Abstract:Software fault prediction (SFP) is a critical task in software engineering, enabling early identification of faults in modules to improve software quality and reduce maintenance costs. This research investigates the combined effects of feature selection and parameter tuning on the performance of machine learning (ML) models for SFP. This study evaluates the interaction between feature selection methods, including correlation-based feature selection (CFS), recursive feature elimination (RFE), mutual information (MI), and L1 regularization, where hyperparameter tuning techniques such as grid search, randomized search, and genetic algorithm (GA) are used for optimization of ML algorithms, including random forest (RF), logistic regression (LR), and support vector machines (SVM) for optimized fault prediction performance. The combined application of CFS and GA yielded the highest accuracy, achieving 88.40% with RF, representing an improvement of 18% over baseline models without feature selection or tuning. Feature selection reduced dimensionality and identified critical attributes such as weighted methods per Class (WMC) and coupling between objects (CBO), while iterative parameter tuning optimized model alignment to these feature sets. Notably, the proposed methods demonstrated robustness, with minimal cross-validation variability (±1.0%), and efficiency, reducing training times in univariate methods such as L1 regularization.
[LG-73] Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification
链接: https://arxiv.org/abs/2605.17609
作者: Shaddin Dughmi,Mahdi Haghifam,Yusuf Hakan Kalayci
类目: Machine Learning (cs.LG)
*备注: 33 Pages, 6 Figures, 4 Tables
Abstract:Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation. We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score–label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines. Comments: 33 Pages, 6 Figures, 4 Tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.17609 [cs.LG] (or arXiv:2605.17609v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17609 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-74] he Neural Tangent Kernel for Classification
链接: https://arxiv.org/abs/2605.17606
作者: Jonathan Plenk,Sergio Calvo-Ordonez,Alvaro Cartea,Yarin Gal,Mark van der Wilk,Kamil Ciosek
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:In wide neural networks, the Neural Tangent Kernel (NTK) remains approximately constant during training, providing a powerful theoretical tool for studying training dynamics, generalization, and connections to kernel methods. However, this theory is largely restricted to regression losses. It was previously thought that training on a classification loss, or more generally losses involving nonlinear output transformations, breaks this property, leading to divergent logits and a breakdown of the linearization. In this paper, we extend NTK theory to classification by identifying conditions under which wide neural networks remain in the lazy training regime. We show that parameter-space regularization ensures a constant NTK during training for cross-entropy loss, while in the absence of regularization the regime is recovered when targets are non-degenerate, i.e. when all classes have strictly positive probability. Under these conditions, training is well-approximated by the linearized model, yielding an explicit characterization of the solution in terms of the NTK. We further analyze the distribution of trained predictors induced by random initialization and relate this notion of model uncertainty to Bayesian methods.
[LG-75] Venom: A PyTorch Generative Modeling Toolkit
链接: https://arxiv.org/abs/2605.17605
作者: Liang Yan
类目: Machine Learning (cs.LG)
*备注: Preprints
Abstract:Modern generative modeling has grown into a broad collection of related but often separately implemented paradigms, including denoising diffusion models, score-based stochastic differential equations, flow matching, variational autoencoders, normalizing flows, adversarial models, and energy-based models. For newcomers, this fragmentation makes it difficult to compare training objectives, inference procedures, sampling algorithms, and conditioning mechanisms within a single coherent codebase. We introduce V ENOM, an educational PyTorch toolkit that implements representative generative modeling families under a unified, MNIST-first interface. V ENOM emphasizes breadth, readability, reproducible entry points, and consistent training and sampling APIs rather than large-scale performance engineering. The package currently includes diffusion and score-based models, flow matching and one-step generators, variational autoencoders, normalizing flows, generative adversarial networks, and energy-based models. It provides separate training and sampling scripts, classifier and classifier-free guidance examples, bilingual tutorial notebooks, and a model-family organization that supports teaching, prototyping, and lightweight benchmarking.
[LG-76] Form and Function: Machine Unlearning as a Problem of Misaligned States
链接: https://arxiv.org/abs/2605.17590
作者: Kennon Stewart
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We formulate machine unlearning for online L-BFGS as a counterfactual state-alignment problem. Given an actual event stream and a deletion-edited counterfactual stream, the target of unlearning is the optimizer state that would have arisen had the deleted samples never been processed. We introduce state-aware metrics that separately measure parameter error, memory-operator error, combined state error, and update-direction error. The memory metric compares the inverse-Hessian actions induced by the o-L-BFGS memory, rather than treating curvature pairs as of finite influence. Under convexity assumptions, we derive a recursive bound on counterfactual state deviation. We then evaluate a state-aware benchmark of deletion interventions, including memory-only and parameter-only corrections, against an counterfactual oracle model. These results show that unlearning for online L-BFGS is not merely a parameter-correction problem: it requires alignment with a realizable counterfactual optimizer state.
[LG-77] Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions Wavelet Scattering Inputs and Spectral-Consistency Training for Self-Similar Time Series
链接: https://arxiv.org/abs/2605.17582
作者: Andrea Morandi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Many natural and engineered time series – equity returns, climate anomalies, turbulent velocities, neural recordings, packet-level network traffic – are approximately self-similar: their horizon- T distribution is tied to the horizon- 1 distribution by one scaling exponent H . Standard deep generative sequence models (transformers, dilated TCNs, the WaveNet family) ignore this. Their receptive fields are wide, but kernel parameters live independently at every dilation level, yielding a multi-scale architecture, not a scale-equivariant one. We make three contributions. First, we give a precise definition of discrete scale equivariance for 1D causal networks and prove that dyadic dilation commutes (up to boundary effects) with any dilated-convolution stack whose kernel weights are shared across levels. Tying the kernel shrinks the convolutional parameter budget by an L -fold factor (where L is depth) and hard-wires self-similarity in as an inductive bias. Second, we wrap this Scale-Equivariant WaveNet (SE-WaveNet) backbone in three components that carry the same prior: a one-level Daubechies-4 wavelet input, a Hurst-FiLM block exposing the local scaling exponent, and a spectral-consistency training term targeting the |f|^-(2H+1) power-law spectrum. The head is a conditional normalising flow, chosen to preserve equivariance. Third, on 30 years of SP 500 daily log-returns, SE-WaveNet samples reproduce the empirical scaling-collapse diagnostic on the Allan-Variance top-25 universe (median \mathcalC^\star = 0.020 ), while a vanilla WaveNet at matched capacity does not ( \geq 0.06 ). NLL, KS-calibration, and tail energy distance tie or beat the baseline, with L\times fewer convolutional parameters.
[LG-78] Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
链接: https://arxiv.org/abs/2605.17568
作者: Zhitong Xu,Qiwei Yuan,Yinghao Chen,Shandian Zhe,Bin Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-class event streams arise in numerous real-world applications, where uncovering structured, interpretable inter-event relationships, together with accurate prediction, remains a central challenge. Existing neural point process models are highly expressive but encode event interactions in a black-box manner, preventing explicit discovery of structured dependencies. In this paper, we propose a structured neural marked point process (SNMPP) that achieves high modeling flexibility while enabling explicit event-wise and class-wise relationship discovery from data. Our model constructs a product-form neural influence kernel composed of a signed interaction network over event types and a delay-aware monotonic temporal network. This design enables explicit characterization of inter-class influence topology – including excitation, inhibition, and neutrality – while flexibly capturing diverse temporal decay patterns and potential influence delays. For efficient learning, we develop a stratified Monte Carlo estimator for stochastic training. Extensive experiments on synthetic and real-world benchmark datasets validate the ability of our approach to uncover structured relationships and deliver strong predictive performance.
[LG-79] Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning
链接: https://arxiv.org/abs/2605.17552
作者: Vedant Waykole,Haroon R. Lone
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbfQ-LocalAdam, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ( \alpha \in \0.1, 0.5, 1.0, \textIID\ ), Q-LocalAdam achieves 3.37\times optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, \alpha=0.1 ). Multi-seed validation confirms statistical significance ( p0.01 ). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.17552 [cs.LG] (or arXiv:2605.17552v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17552 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-80] Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall
链接: https://arxiv.org/abs/2605.17524
作者: Wenxuan Xiao
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 17 pages, 1 figure, 15 tables (5 in main text, 10 in appendix)
Abstract:Binary quantization (BQ) compresses high-dimensional embeddings into one or two bits per coordinate, enabling nearest neighbor search at extreme speed. Yet a striking puzzle persists: BQ achieves competitive recall on contrastive embeddings but fails on others – and two leading systems adopt diametrically opposite strategies (random rotation vs. preserving coordinate axes) without a common theory explaining when each is appropriate. We resolve this puzzle by connecting the Gaussian structure recently established for InfoNCE-trained representations to a complete analytical framework for BQ quality. The key insight is that coordinate heterogeneity – the non-uniformity of per-coordinate variances – governs the key aspects of BQ performance. We derive closed-form expressions for ranking fidelity, prove that the magnitude bit carries information proportional to heterogeneity, and show that random rotation destroys precisely the signal that one paradigm exploits while creating the isotropy that the other requires. A two-parameter scaling law predicts fidelity across models and dimensions. Experiments on 13 datasets and 6 embedding families validate all predictions and provide the first principled design guide for binary quantization systems.
[LG-81] -gems: text-guided exit modules for decreasing clip image encoder ICASSP2026
链接: https://arxiv.org/abs/2605.17499
作者: Alberto Presta,Grzegorz Stefanski,Michal Byra,Krzysztof Arendt
类目: Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2026
Abstract:Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.
[LG-82] Self-Supervised On-Policy Distillation for Reasoning Language Models
链接: https://arxiv.org/abs/2605.17497
作者: Zhiquan Tan,Yinrong Hong
类目: Machine Learning (cs.LG)
*备注:
Abstract:GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emphSelf-Supervised On-Policy Distillation (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct–wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at this https URL.
[LG-83] DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
链接: https://arxiv.org/abs/2605.17486
作者: Sixu Lin,Yunpeng Qing,Litao Liu,Ming Zhou,Ruixing Jin,Xiaoyi Fan,Guiliang Liu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.
[LG-84] WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points ICML2026
链接: https://arxiv.org/abs/2605.17471
作者: Dongyue Li,Zechun Liu,Kai Yi,Zhenshuo Zhang,Changsheng Zhao,Raghuraman Krishnamoorthi,Harshit Khaitan,Hongyang R. Zhang,Steven Li
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 23 pages; To appear in ICML 2026
Abstract:Quantization-aware training (QAT) is widely adopted to quantize language models by training full-precision weights using gradients from the quantized model. The main bottleneck is its slow convergence and early performance plateau, particularly below 4-bit-widths. While this problem has been observed in prior work, its precise cause remains unclear. In this paper, we analyze the convergence of QAT by estimating the spectrum of the loss-surface Hessians. We find that the weights converge to flat regions around saddle points, where a large fraction of the Hessian eigenvalues are both positive and negative. During training, an increasing fraction of Hessian eigenvalues concentrates around zero, whose magnitude decreases. At lower bit-widths, the magnitude of eigenvalues in the Hessian spectrum is significantly smaller. To mitigate these issues, we propose an algorithm called WinQ to accelerate QAT, which involves: (1) periodically resetting weights to the linear interpolation of full-precision and quantized weights, reducing the distance to the quantization grid and increasing eigenvalue magnitude, and (2) computing gradients of noise-injected weights to regularize the Hessian. Extensive experiments show that WinQ accelerates QAT by up to 4 times across various quantization methods and models. Under the same training cost, WinQ improves state-of-the-art sub-4-bit quantization by up to 8.8%. These results are consistent across 16 settings with different language models, quantization methods, and bit widths.
[LG-85] riOpt: A Scalable Algorithm for Linear Causal Discovery
链接: https://arxiv.org/abs/2605.17465
作者: Rafat Ashraf Joy,Elena Zheleva
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning causal relations from observational data is challenging because the graph search space grows super-exponentially with the number of variables. Ordering-based methods reduce this space by first identifying the topological ordering, whereas continuous optimization methods explore most likely regions of the space by casting DAG learning as a differentiable objective with an acyclicity constraint. Despite their conceptual appeal, both paradigms face significant scalability limitations in high-dimensional settings, restricting their practical applicability. In this work, we introduce a new formulation for linear causal discovery that tightly integrates these two paradigms to achieve substantial gains in scalability without sacrificing accuracy. Our approach, TriOpt, decomposes the problem into two efficient stages. First, it recovers the topological ordering by exploiting the Sherman-Morrison rank-1 downdate together with the additive structure of linear kernels, enabling fast and scalable ordering estimation. Second, given this ordering, we reformulate structure learning as a convex continuous optimization problem that entirely avoids the need for enforcing costly acyclicity constraints. We theoretically show that, under the true ordering, TriOpt exactly recovers the underlying linear DAG. Empirically, across synthetic, semi-synthetic, and real-world datasets, TriOpt achieves orders-of-magnitude speedups over state-of-the-art linear causal discovery methods in high-dimensional regimes, while maintaining comparable or superior accuracy.
[LG-86] ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks
链接: https://arxiv.org/abs/2605.17458
作者: Tianxiang Xu,Xiaoyan Zhu,Xin Lai,Jiayin Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model’s ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at this https URL.
[LG-87] A semantic mutation metric for metamorphic relation adequacy in scientific computing programs
链接: https://arxiv.org/abs/2605.17437
作者: Meng Li(1,2,3),Xiaohua Yang(1,2,3),Jie Liu(1,2,3),Shiyu Yan(1,2,3) ((1) School of Computing, University of South China, Hengyang, 421001, China (2) Hunan Engineering Research Center of Software Evaluation and Testing for Intellectual Equipment, Hengyang, 421001, China (3) CNNC Key Laboratory on High Trusted Computing, Hengyang, 421001, China)
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Submitted to Information and Software Technology (IST), Elsevier. Manuscript: 93 pages in elsarticle review mode (12pt double-spaced, ~28-35 pp typeset). Supplementary code and 12-PUT pool at this https URL
Abstract:Context. Metamorphic Testing addresses the test-oracle problem in scientific computing, but classical Mutation Score operates on syntactic AST mutations and misses domain semantics. Objective. We propose the Semantic Mutation Score (SMS), built on five domain-semantic operators (Conservation Erosion, Operator Substitution, Hyperparameter, Trajectory Flip, Structural Injection). SMS degenerates almost everywhere to MS in a characterised limit, so any SMS-based conclusion remains consistent with prior mutation-testing literature in the classical regime. Method. A 12-PUT x 5-MP design over four single-output float-to-float classes (numeric, probabilistic, surrogate, machine-learning) is paired with a three-layer attribution classifier separating true semantic faults from tolerance, OOD, statistical, and artefact categories. A same-source / cross-source ablation under an identical prompt isolates the LLM-source-diversity contribution. LLM-generated mutants are compared against a default-configuration cosmic-ray syntactic pool at the AST-normalised level. Results. The pre-registered large-effect threshold for Cliff’s delta is not met under the point-estimate criterion; the observed effect lies in the medium-effect range. Cross-source pooling under an identical prompt does not appreciably shift delta, indicating that LLM identity is not the lever within this design. AST-level overlap between LLM-generated and default cosmic-ray syntactic mutants is small; the Hyperparameter, Structural Injection, and Trajectory Flip classes are unreachable under default first-order syntactic configurations. Conclusion. SMS is a backward-compatible adequacy metric for domain-semantic metamorphic-relation sets in scientific computing. The first-order unreachability evidence is independent of the effect-size question. Comments: Submitted to Information and Software Technology (IST), Elsevier. Manuscript: 93 pages in elsarticle review mode (12pt double-spaced, ~28-35 pp typeset). Supplementary code and 12-PUT pool at this https URL Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) ACMclasses: D.2.5; D.2.8 Cite as: arXiv:2605.17437 [cs.SE] (or arXiv:2605.17437v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.17437 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Meng Li [view email] [v1] Sun, 17 May 2026 13:16:13 UTC (642 KB) Full-text links: Access Paper: View a PDF of the paper titled A semantic mutation metric for metamorphic relation adequacy in scientific computing programs, by Meng Li (1 and 21 other authorsView PDFTeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-88] DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models
链接: https://arxiv.org/abs/2605.17432
作者: Haichao Sha,Zihao Wang,Yuncheng Wu,Hong Chen,Wei Dong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Large language models (LLMs) are commonly adapted to downstream tasks through fine-tuning, but fine-tuning data often contains sensitive information that may be leaked by the resulting model. Differential privacy (DP) offers formal protection against such leakage, yet DP fine-tuning of LLMs still suffers from substantial utility degradation due to gradient clipping and noise injection. Existing work improves this trade-off by combining DP with parameter-efficient fine-tuning methods such as LoRA, which constrain the form of updates. In this work, we study a complementary direction: selective fine-tuning, which constrains where updates are applied. We propose DP-SelFT, a framework for differentially private selective fine-tuning of LLMs. DP-SelFT addresses three DP-specific challenges in parameter selection: avoiding repeated privacy cost, improving stability under noisy estimates, and selecting parameters that remain useful under clipped and noisy updates. It first constructs a lightweight DP synthetic dataset and performs selection only on this synthetic data, so the selection stage incurs no additional privacy cost. It then conducts layer-level selection by temporarily training candidate layer subsets on a synthetic training split and evaluating them on a synthetic validation split. Crucially, this temporary training is performed under a perturbation regime matched to downstream DP fine-tuning, with worst-case perturbations of the same scale as DP noise. This favors layer subsets that are not only learnable but also robust to noisy private updates. Experiments on benchmark tasks show that DP-SelFT consistently improves the privacy–utility trade-off over existing DP fine-tuning baselines under the same privacy guarantees. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2605.17432 [cs.LG] (or arXiv:2605.17432v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-89] Self-Supervised Learning for Sparse Matrix Reordering DASFAA2026
链接: https://arxiv.org/abs/2605.17403
作者: Ziwei Li,Tao Yuan,Fangfang Liu,Shuzi Niu,Huiyuan Li,Wenjia Wu
类目: Machine Learning (cs.LG)
*备注: Accepted by DASFAA 2026
Abstract:Rearranging the rows or columns of a sparse matrix using an appropriate ordering can significantly reduce fill-ins, i.e., new nonzeros introduced during matrix factorization, decreasing memory usage and runtime. However, finding an ordering that minimizes fill-ins is NP-complete. Existing approaches, including graph-theoretic and deep learning methods, rely on surrogate objectives without theoretical guarantees. The Fill-Path Theorem reveals a direct and intrinsic relationship between fill-in generation and the sparse structure of the matrix as path triplet inequalities. Here we first employ a multigrid graph network to capture structural information for each vertex. We then derive a triplet sampling strategy based on inequalities. Finally, we introduce an end-max chain loss function to reduce the number of triplets whose predicted scores satisfy these inequalities. Experimental evaluations on the publicly available SuiteSparse matrix collection demonstrate the superiority of the proposed method in terms of both fill-in reduction and speedup in LU factorization time.
[LG-90] NOETHER: A Constructive Framework for Metamorphic Pattern Discovery from Operator Algebras
链接: https://arxiv.org/abs/2605.17390
作者: Meng Li(1,2,3),Xiaohua Yang(1,2,3),Jie Liu(1,2,3),Shiyu Yan(1,2,3) ((1) School of Computing, University of South China, Hengyang, 421001, China (2) Hunan Engineering Research Center of Software Evaluation and Testing for Intellectual Equipment, Hengyang, 421001, China (3) CNNC Key Laboratory on High Trusted Computing, Hengyang, 421001, China)
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 71 pages, 18 tables, 1 figure. Under review at ACM Transactions on Software Engineering and Methodology. Supplementary materials (algorithm reference implementation, 84-MR PWR corpus, SE(3) case study harness, three-tier METRIC+ replication) at this https URL
Abstract:Context. Metamorphic Testing is recognised in IEEE/ISO software-testing standards and increasingly recommended for AI systems, but its progress is bottlenecked by metamorphic relation (MR) identification: existing approaches (structured frameworks, mining and evolutionary pipelines, LLM-assisted methods, MetaPattern catalogues) share an inductive grounding that leaves three foundational questions open: origin, closure, and transferability. Objective. We propose a framework whose downstream step from program-induced operator algebra to MetaPattern set is mechanical and provable, while the upstream curation of the algebra is a stated empirical hypothesis with explicit scope precondition. Method. NOETHER is a two-layer framework. The upstream layer is an eight-block decomposition over recurrent mathematical structures (symmetry, order, self-adjoint, time-reversal, limit, qualitative-dynamics, method-comparison, relational equivalence). The downstream CONSTRUCT-MP algorithm produces a MetaPattern set with algebraic-closure (Theorem 1) and polynomial-time decidability (Theorem 2) guarantees. We test the framework on three operator-algebraic domains. Results. On Boltzmann reactor physics NOETHER systematises a prior inductive catalogue; on equivariant ML it derives executable MRs for rotation invariance, adjoint duality, and training-trajectory reversibility; on relational query optimisers it exercises the relational-equivalence block. The central falsifiable prediction (L*-blindness on homogeneity-preserving mutators) holds on the in-scope substrate. The absolute-completeness conjecture (Theorem 1’) is falsified on PWR core diffusion via two pairwise-independent counterexamples that identify five Translate-extension dimensions. Conclusion. Induction is relocated from per-program MR sampling to a per-domain algebraic layer; the downstream step is deductive and mechanical. Comments: 71 pages, 18 tables, 1 figure. Under review at ACM Transactions on Software Engineering and Methodology. Supplementary materials (algorithm reference implementation, 84-MR PWR corpus, SE(3) case study harness, three-tier METRIC+ replication) at this https URL Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) MSC classes: 68N30 ACMclasses: D.2.4; D.2.5 Cite as: arXiv:2605.17390 [cs.SE] (or arXiv:2605.17390v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.17390 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Meng Li [view email] [v1] Sun, 17 May 2026 11:18:35 UTC (125 KB) Full-text links: Access Paper: View a PDF of the paper titled NOETHER: A Constructive Framework for Metamorphic Pattern Discovery from Operator Algebras, by Meng Li (1 and 21 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.LG cs.LO References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-91] Learning Fill-in Reduction Ordering via Graph Policy Optimization for Sparse Matrices ICASSP2026
链接: https://arxiv.org/abs/2605.17362
作者: Ziwei Li,Shuzi Niu,Huiyuan Li,Tao Yuan,Wenjia Wu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICASSP 2026
Abstract:Matrix reordering in large sparse solvers seeks a permutation that minimizes factorization fill-in to reduce memory and computation. Because the minimum fill-in ordering problem is NP-complete and fill-in is implicit in the sparsity pattern, graph-theoretic heuristics are used. Existing reinforcement learning methods either ignore sparsity patterns–missing the global fill-in–or lack local exact fill-in feedback. We propose a graph policy optimization method, modeling fill-ins from global and local views: both the policy and value networks use a multi-hop graph neural backbone to embed global fill-in; the policy further interacts with symbolic factorization over graphs to extract local, step-level fill-ins, and the resulting feedback is aligned with the value network via an adaptive saturation function to improve convergence. On the SuiteSparse Matrix Collection, our method achieves mean reductions of 29.3 in fill-ins and 31.3 in peak memory usage over state-of-the-art baselines.
[LG-92] Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density ICML2026
链接: https://arxiv.org/abs/2605.17340
作者: Jingru Fei,Kun Yi,Alex Xing Wang,Qingsong Wen,Xiangxiang Zhu,Wei Fan
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Time series foundation models rely on large-scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios. Our code is available at \urlthis https URL.
[LG-93] Bridging the Gap between Sparse Matrix Reordering and Factorization: A Deep Learning Framework for Fill-in Reduction DASFAA2025
链接: https://arxiv.org/abs/2605.17339
作者: Ziwei Li,Tao Yuan,Shuzi Niu,Huiyuan Li
类目: Machine Learning (cs.LG)
*备注: Accepted by DASFAA 2025
Abstract:Sparse matrix reordering can significantly reduce the fill-in during matrix factorization, thereby decreasing the computational and storage requirements in sparse matrix computations. Finding a minimal fill-in ordering is known to be an NP-hard problem. Moreover, there is a paradox: matrix reordering is applied before matrix factorization, but fill-ins that matrix reordering methods aim at are generated from matrix factorization. To bridge the gap between reordering and factorization, we propose a deep learning framework to minimize a fill-in surrogate function based on spectral embedding. First, we employ a multi-grid-like GNN architecture to learn to approximate the smallest eigenvectors of its graph Laplacian matrix, i.e. spectral embedding, and capture the global structural information of the matrix. Then, another multi-grid-like GNN architecture is used to minimize the potential space where fill-in can occur based on the rank distribution. Experimental results indicate that our approach achieves competitive performance compared with traditional graph-theoretic algorithms and deep learning methods.
[LG-94] Leverag ing Error Diversity in Group Rollouts for Reinforcement Learning
链接: https://arxiv.org/abs/2605.17333
作者: Wenpu Liu,Yuqi Xu,Weichu Xie,Yongfu Zhu,Shuai Dong,Ziyue Wang,Wenqi Shao,Xiaoying Zhang,Tong Yang,Nan Duan,Jiaqi Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the distribution of errors, is largely discarded. We identify this as a missed opportunity: empirical analysis reveals that error diversity within a group is a strong predictor of training success, with problems eliciting diverse wrong answers benefiting substantially more from RLVR than those producing homogeneous failures. Motivated by this observation, we propose Error Diversity Advantage Shaping (EDAS), a lightweight, algorithm-agnostic technique that modulates the advantage signal for incorrect rollouts based on intra-group error diversity. EDAS amplifies penalties for dominant, repeated errors and attenuates penalties for rare, exploratory ones, thereby encouraging the model to maintain diverse reasoning paths and discouraging error perseveration. Crucially, EDAS operates as a simple post-hoc adjustment that can be seamlessly integrated into any RLVR algorithm. We validate EDAS on top of several mainstream RLVR methods across a series of models and seven challenging math benchmarks, demonstrating consistent improvements. Notably, EDAS yields an average improvement of 6.29 points over DAPO on Qwen3-8B across seven benchmarks, confirming that exploiting the latent information in group rollouts is a broadly effective strategy for strengthening RLVR.
[LG-95] Step-wise Rubric Rewards for LLM Reasoning
链接: https://arxiv.org/abs/2605.17291
作者: Weichu Xie,Haozhe Zhao,Wenpu Liu,Yongfu Zhu,Liang Chen,Minghao Ye,Zirong Chen,Yuqi Xu,Shuai Dong,Ziyue Wang,Xinbo Xu,Kean Shi,Ruoyu Wu,Xiaoying Zhang,Wenqi Shao,Baobao Chang,Nan Duan,Jiaqi Wang
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.
[LG-96] Calibeating for general proper losses: A Bregman divergence approach
链接: https://arxiv.org/abs/2605.17269
作者: Maximilian Fichtl,Cristóbal Guzmán,Nishant A. Mehta
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages
Abstract:This work introduces a general framework for calibeating based on regret minimization. As compared to Foster and Hart’s seminal calibeating work which had specialized treatments of Brier score (squared loss) and log loss, we consider a large family of proper losses that includes \alpha -Tsallis losses (for \alpha \in [1, 2] ) and Lipschitz losses. Our results for Tsallis losses also hold for an unscaled version of Tsallis loss that recovers log loss. Our analysis is oriented around the Bregman divergence view of a proper loss. Technically, our results for the family of Tsallis losses that we consider are U-calibration results, simultaneously obtaining logarithmic regret for all losses in this family while having a weaker dependence on the dimension compared to previous results. Of potential independent interest, we also show a new regret equality for the regret of Be The Regularized Leader. This regret equality holds for general proper losses and itself is based on two results related to online updating formulas for the generalized variance, the latter being a previously introduced generalization of variance based on Bregman divergences.
[LG-97] When Molecular Similarity Works: Property Cliffs Reveal Hidden Errors
链接: https://arxiv.org/abs/2605.17265
作者: Di Hu,Kun Li,Haojie Rao,Longtao Hu,Jiameng Chen,Wenbin Hu,Yizhen Zheng,Jiajun Yu,Duanhua Cao
类目: Machine Learning (cs.LG)
*备注: Preprint, 22 pages, 10 figures, 11 tables. Di Hu and Kun Li contributed equally
Abstract:Accurate prediction of molecular properties underpins drug discovery and material design, yet even state-of-the-art models remain vulnerable to localized failure modes that aggregate metrics cannot detect. The places where molecular similarity should be most helpful are also places where standard evaluation can be most misleading. Property cliffs expose this gap: structurally similar molecules can still differ sharply in target property, so models with competitive overall performance may fail in high-risk local neighborhoods. To expose and mitigate this failure mode, CliffSplit, a cliff-aware evaluation protocol that constructs locally supported, cliff-exposed test cases, and CliffLoss, a model-agnostic train-only mitigation mechanism for cliff-sensitive errors, are introduced. Experiments on three QM9 targets and three MoleculeNet tasks across five backbones show that CliffSplit reveals at least 15% higher error in cliff-heavy QM9 regions, while CliffLoss reduces the cliff-to-smooth error gap by up to 30% on Lipophilicity and improves overall MAE by 9.7%. Together, these results turn molecular similarity failure from a descriptive anomaly into a benchmarked evaluation problem for molecular machine learning. The code is available at this https URL.
[LG-98] Iterative Chow Filtering for Learning with Distribution Shift
链接: https://arxiv.org/abs/2605.17251
作者: Gautam Chandrasekaran,Georgios Gkrinias,Adam R. Klivans,Konstantinos Stavropoulos,Arsen Vasilyan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 30 pages
Abstract:Recent work due to Goel et al. gave the first efficient algorithms for learning with distribution shift in the challenging PQ framework. In this setting, a learner receives labeled training examples, unlabeled test examples, and must make correct predictions on the test set but is allowed to abstain from predicting on out-of-distribution points. Their results rely on \cal L_2 sandwiching approximations, a strong requirement that leads to poor bounds for several basic function classes such as DNF formulas. Here, we show that the weaker notion of \cal L_1 sandwiching suffices for efficient PQ learning. As a consequence, we obtain the first quasipolynomial-time PQ learning algorithm for DNFs under the uniform distribution and essentially match the guarantees known for ordinary PAC learning. More broadly, our bounds provide exponential improvements for several classes including constant depth circuits and constant degree polynomial threshold functions. Our main technical ingredient is Iterative Chow Filtering, a new procedure that uses low-degree Chow parameters to identify and remove test points incompatible with the training distribution. Comments: 30 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.17251 [cs.DS] (or arXiv:2605.17251v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.17251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-99] owards Principled Test-Time Adaptation for Time Series Forecasting
链接: https://arxiv.org/abs/2605.17250
作者: Haochun Wang,Ruichen Xu,Georgios Kementzidis,Karen Cho,Sebastian Ramirez Villarreal,Yuefan Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Test-time adaptation (TTA) has recently emerged as a promising approach for improving time series forecasting (TSF) under distribution shift. Existing TSF-TTA methods differ in how they utilize revealed targets, yet the resulting adaptation protocols remain heterogeneous and lack a clearly unified formulation. To address this issue, we revisit TSF-TTA from the perspective of protocol cleanliness and propose an adaptation protocol based solely on matured ground truth, yielding a more principled setting for adaptation. Under this protocol, we further diagnose existing adapters in the frequency domain and find that their prediction corrections often exhibit limited and weakly structured spectral modifications. Motivated by this diagnosis, we propose Frequency-Aware Calibration (FAC), a lightweight calibration method that directly parameterizes prediction corrections in the frequency domain. Across diverse datasets, forecasting horizons, and source forecasters, FAC achieves competitive and consistent performance while requiring substantially fewer trainable parameters than the compared TSF-TTA adapters.
[LG-100] An Efficient Machine Learning-based Framework for Detection and Prevention of Frauds in Telecom Networks
链接: https://arxiv.org/abs/2605.17245
作者: Praveen Hegde,Mishal Shah
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Peer-reviewed and presented at 2025 International Conference on Advancement in Communication and Computing Technology (INOACC-2025); self-published by the author due to a sustained 13-month indexing delay by the organizers. Contains 7 pages and 7 figures
Abstract:Telecommunication fraud is an acute problem that leads to substantial material losses and compromises the reliability of telecom systems worldwide. Only effective and efficient detection mechanisms can help to deal with these threats, though there are certain shifts in the approaches to fraud detection. This paper evaluates the performance of AI-driven models for fraud detection in telecommunication networks using Call Detail Record (CDR) datasets. This study focuses on fraud detection in telecom networks using the Telecom CDR dataset, which contains 101,174 customer records with 17 attributes, including 8,830 fraud cases. In feature preprocessing, missing values were dealt with, followed by data scaling using Min-Max scaling and data balancing using the SMOTE technique. The dataset was trained for predictive analysis using Random Forest (RF) and XGBoost models. F1-score, ROC AUC, recall, accuracy, time, and precision were used as indicators with which to compare performance of the two models. RF recorded a high level of accuracy at 99.9% while XGBoost at 99.7%. Findings show that the suggested framework successfully detects fraud with few misclassifications. Several machine learning models were evaluated and contrasted, such as RF, XGBoost, DBSCAN, RoBERTa, and K-means. Among all the models, RF was seen to give the highest performance with an accuracy of 99.9% and precision of 99.9%, recall of 99.9% and F1-score of 99.9%, XGBoost, GNN and BERT. The findings emphasize RF as the most effective model for detecting fraudulent activities in telecom networks, ensuring robust and reliable prevention of fraud.
[LG-101] Learning in Position-Aware Multinomial Logit Bandits: From Multiplicative to General Position Effects
链接: https://arxiv.org/abs/2605.17238
作者: Xi Chen,Shibo Dai,Jiameng Lyu,Yuan Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the dynamic joint assortment selection and positioning problem, where the attraction of each product depends on both its intrinsic appeal and its display position under a Multinomial Logit (MNL) choice framework. Our study ranges from the multiplicative position effects model, in which each product’s attraction is scaled by a position-specific factor, to a general position effects model assigning independent attraction parameters to every product–position pair to capture heterogeneous synergies. For both models, we design round-based learning algorithms that update decisions after every single feedback, and establish the first regret-optimal characterization. Besides, our round-based algorithms provide the prompt operations needed by modern platforms. For the multiplicative model, we develop a cross-position pairwise maximum likelihood estimator with a clipping mechanism, and prove that our algorithm P2MLE-UCB attains a regret of \tildeO(\sqrtNT) , matching the lower bound and closing the \sqrtK gap left by prior epoch-based analyses. For the general model, we establish a minimax lower bound and propose GP2-UCB with a matching upper bound. Moreover, we design an efficient subroutine for the per-round joint assortment and positioning optimization based on Dinkelbach’s method and maximum-weight bipartite matching. Numerical experiments on synthetic data and the Expedia dataset show that our algorithms consistently outperform state-of-the-art benchmarks.
[LG-102] Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning ICML2026
链接: https://arxiv.org/abs/2605.17234
作者: Viktoria Schram,Markus Hiller,Daniel Beck,Trevor Cohn
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Predicting model performance at larger scales enables the design of training strategies and architectures tailored to specific performance targets. Empirical scaling law research identifies functional forms to aid this prediction task. These describe the relationship between loss and compute using a loss-compute frontier defined by learning curves. Due to the empirical nature of this approach, the computational burden is substantial, making strategic resource allocation essential - yet it remains surprisingly underexplored. In this work, we address this shortcoming by exploring the suitability of Successive Halving (SH) and SH combined with parametric and non-parametric surrogate models. In addition to enabling a more systematic allocation of a given compute budget, our findings show that SH paired with surrogate models yields a set of learning curves that includes one with a lower loss-compute value than what naive uniform allocation or an SH-only approach can obtain. Our experiments demonstrate mean relative improvements of up to 2.84% and 5.47% on real-world and synthetic learning curve datasets. This strategic resource allocation enables us to obtain accurate scaling laws at significantly reduced computational costs, saving up to 98.7% over the traditional exhaustive approach.
[LG-103] Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space
链接: https://arxiv.org/abs/2605.17232
作者: Kelvin Kan,Xingjian Li,Benjamin J. Zhang,Tuhin Sahai,Stanley Osher,Markos A. Katsoulakis
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size S and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of S and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and is compatible with time-inhomogeneous schedules. Four novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes S -dependence under uniform transitions, and a score-marginal cancellation technique that removes S -dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2605.17232 [cs.LG] (or arXiv:2605.17232v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.17232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-104] Anytime and Difficulty-Adaptive PAC-Bayes for Constrained Density-Ratio Network with Continual Learning Guarantees
链接: https://arxiv.org/abs/2605.17212
作者: Paulo Akira F. Enabe
类目: Machine Learning (cs.LG)
*备注:
Abstract:A unified framework for learning under covariate shift is presented, in which a constrained density-ratio network approximates the Radon-Nikodym derivative r^\star = dP/dQ from source Q to target P , supports an importance-weighted empirical risk, and feeds an anytime PAC-Bayes generalization certificate. A change-of-measure identity decomposes the gap between target risk and importance-weighted source risk into a ratio-bias term, controlled by the L^2(Q) closeness of the learned ratio to r^\star , and a generalization-gap term, controlled by the variability of the weighted loss. Three structural identities of a Radon-Nikodym derivative, normalization, moment matching, and a second-moment penalty controlling the effective sample size, are imposed as hard integral constraints through an augmented-Lagrangian scheme. PAC-Bayes is then instantiated on the weighted risk in a fixed-time regime that yields Bernoulli-KL bounds, a KL-regularized objective whose minimizer is the network-weighted Gibbs posterior, and a stability statement on L^2(Q) perturbations of the learned ratio, and in an anytime regime that builds a time-uniform certificate by geometric peeling across epochs. A pre-registered two-campaign protocol combining a patch test against analytic ground truth with a real-data deployment under intrinsic distribution shift validates the framework. The network produces a calibrated covariate ratio on real data, reduces the target 0/1 loss relative to unweighted empirical risk minimization and to classical direct ratio-estimation baselines, and attains the anytime certificate as the construction promises. A single pre-registered failure of the fixed-time coverage claim is recorded, with per-split coverage aligning one-to-one with the magnitude of the label shift, confirming that the covariate-only assumption is operationally tight rather than a defect of the certificate.
[LG-105] Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks
链接: https://arxiv.org/abs/2605.17201
作者: Barsat Khadka,Prasant Koirala,Kshitiz Neupane,Nick Rahimi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under review at Elseiver’s Computer and security journal
Abstract:Social engineering attacks exploit human trust rather than software vulnerabilities, making them difficult to detect using conventional filters. We propose a two-stage filter-then-verify framework combining inductive Graph Neural Networks (GNNs) for structural anomaly detection with a co-attention ModernBERT model for content verification. The GNN identifies anomalous sender-receiver patterns, while BERT analyzes message context to reduce false positives. Using the Enron dataset augmented with realistic synthetic campaigns, we show that the framework achieves 86% recall in structural filtering and over 92% precision after BERT refinement, effectively detecting both external attacks and insider threats. Our results demonstrate that combining structural and content analysis allows practical, scalable detection of multi-stage social engineering attacks in email networks.
[LG-106] he Geometry of Projection Heads: Conditioning Invariance and Collapse ICML2026
链接: https://arxiv.org/abs/2605.17180
作者: Faris Chaudhry
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026. 29 pages, 8 figures, 7 tables
Abstract:We develop a geometric theory of projection heads in self-supervised learning by modeling the head as a trainable Riemannian metric on the backbone representation manifold. We show that linear heads perform implicit subspace whitening, while nonlinear heads adapt local metrics to satisfy the specific topological constraints of the loss, with head depth empirically dictating this capacity. Analyzing dimensional collapse, we prove that smooth nonlinear heads natively induce negative eigenvalues in the Hessian at collapsed equilibria, making them unstable. We empirically validate this by continuously tracking the optimization geometry during training, which reveals that smooth activations like Swish can generate explicit negative curvature to escape collapse, whereas linear and ReLU heads under continuous-time gradient flow cannot, relying instead on discrete-time optimization dynamics and BatchNorm. Finally, we geometrically characterize how metric degeneracy governs the information-invariance trade-off, explaining why the head must be discarded. Evaluated across contrastive and decorrelation-based objectives on foundation models, our results demonstrate that the projection head acts as a universal geometric buffer, decoupling the semantic backbone from the rigid, destructive constraints of the pretraining objective.
[LG-107] riAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agent ic Inference Tasks
链接: https://arxiv.org/abs/2605.17170
作者: Hanzhang Shen,Haoran Wu,Yiren Zhao,Robert Mullins
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agentic workloads have emerged as a major workload for LLM inference. They differ significantly from chat-only workloads, requiring long-context processing, the ability to handle multimodal inputs, and structured multi-turn interactions with tool calling capabilities. As a result, their context exhibits structure that can carry different importance along three key axes: temporal recency to the current turn, modality such as text or image tokens, and semantic role such as user queries, tool calls, observations, or reasoning. These axes capture distinct token behaviors and lead to different sensitivities to KV-cache compression. However, existing KV-cache quantization methods are typically homogeneous or exploit only heterogeneity on a single dimension, such as temporal proximity or modality, overlooking the interactions among them. To this end, we introduce TriAxialKV, a novel mixed-precision KV-cache quantization scheme that assigns each token a triaxial tag, calibrates per-tag sensitivity, and allocates INT2/INT4 bitwidths under a fixed memory budget. We implement TriAxialKV as an end-to-end serving system, comprising calibration, mixed-precision quantization and memory management, and custom fused Triton decode kernels. When using Qwen3-VL-32B-Thinking as a computer-use agent operating the OSWorld, TriAxialKV matches the accuracy of SGLang with BF16 KV cache while supporting 4.5 \times KV cache size and achieving 30% higher end-to-end throughput, when running on real GPU systems.
[LG-108] Stress-Testing Neural Network Verifiers with Provably Robust Instances
链接: https://arxiv.org/abs/2605.17153
作者: David Troxell,Yulia Alexandr,Sofia Hunt,Stephanie Lei,Guido Montúfar
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Optimization and Control (math.OC)
*备注:
Abstract:Neural network verifiers aim to provide formal guarantees on model behavior, but existing verification benchmarks are fundamentally limited by their lack of ground-truth labels. As a result, verifier evaluation relies on indirect heuristics, which prevents exact scoring and systematic study of verifier failure modes. We address this gap by introducing a reusable framework for generating verification instances whose ground-truth robustness labels are known a priori through analytic construction. Our framework led to the discovery of multiple numeric tolerance concerns and an implementation bug in popular verifiers, highlighting the need for ground-truth labels. Additionally, to systematically study verifier failure modes, we introduce the verification Difficulty Profile, a collection of estimable quantities capturing distinct sources of instance hardness. Using our framework and these profiles, we evaluate five state-of-the-art verifiers and show that different instances stress distinct aspects of the verification pipeline. We show that these results can aid the future development of verifiers as they provide actionable targets for improving numerical reliability, relaxation quality, and search behavior. Our code is publicly available: this https URL.
[LG-109] An Analytical Multiple Criteria Framework for Temporal and Dynamic Business-to-Business Customer Segmentation in Manufacturing
链接: https://arxiv.org/abs/2605.17151
作者: Muhammad Raees,Konstantinos Papangelis,Vassilis Javed Khan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In sales and marketing, customer segmentation is an important tool for formulating strategies for customer treatment and supply chain management. Most segmentation implementations rely on limited criteria, such as recency, frequency, and monetary (RFM) modeling, which often fail to capture complex business interactions. In this work, we design and evaluate a dynamic multi-criteria decision-making (MCDM) method in a business-to-business (B2B) manufacturing context by 1) extending RFM to dimensions of stability and growth, 2) integrating an adaptive and analytical hierarchical process to match business objectives, and 3) evaluating multivariate time-series clustering models. We then measure customer stability, tracking between-segment transitions, and volatility over time, and apply a graph-based consensus model to further strengthen the analysis. We test the efficacy of the proposed method using a real-world manufacturing company dataset to segment more than 3,000 B2B customers, showing strong robustness to temporal shifts. The implementation enables domain experts with preferential analytics to devise their strategies, providing effective decision support for B2B customer segmentation.
[LG-110] Weighted Flow Matching and Physics-Informed Nonlinear Filtering for Parameter Estimation in Digital Twins
链接: https://arxiv.org/abs/2605.17146
作者: Yasar Yanik,Himadri Basu,Ricardo G. Sanfelice,Daniele Venturi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 5 figures
Abstract:Digital twins (DTs) rely on continuous synchronization between physical systems and their virtual counterparts through online parameter estimation under uncertainty. In many practical settings, however, this task is challenged by low observability, weak excitation, nonlinear dynamics, and noisy or biased measurements. In this work, we develop a new mathematical framework that integrates Weighted Flow Matching (WFM) generative modeling with physics-informed nonlinear filtering to enhance parameter estimation in DTs. WFM relies on dynamic reweighting of training samples, which guides the generative model toward parameter regimes most informative of the evolving system state. This generative component is tightly coupled with a physics-informed filtering architecture based on the Unscented Kalman Filter (UKF), yielding a unified DT framework that combines data-driven probability transport with physically consistent state and parameter estimation. The effectiveness of the new integrated framework is demonstrated within a spacecraft DT architecture, where stable moment of inertia estimation is achieved under uncertain and noisy sensing, with significant performance improvements over established approaches such as Extended Kalman Filtering (EKF) and Ensemble Kalman Filtering (EnKF). These results highlight the potential of weighted generative modeling as a core mechanism for real-time DT synchronization in operational and mission-critical systems.
[LG-111] Differentiable Optimization Layers for Guaranteed Fairness in Deep Learning ICML
链接: https://arxiv.org/abs/2605.17118
作者: David Troxell,Noah Roemer,Guido Montúfar
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: To be published in International Conference on Machine Learning (ICML), 2026
Abstract:Differentiable optimization layers are traditionally integrated in predict-then-optimize frameworks where a neural model estimates parameters that subsequently serve as fixed inputs to downstream decision-making optimization problems. In this work, we introduce the concept of a “fairness layer”: a differentiable optimization layer appended to a model’s output layer that guarantees a chosen notion of output parity is satisfied when integrated into a neural network. Additionally, we introduce an online primal-dual inference algorithm that provides provable aggregate fairness guarantees for streaming predictions with arbitrarily small batch sizes, where traditional per-batch constraints become overly restrictive. Numerical experiments demonstrate the effectiveness of the fairness layer and associated algorithm, and theoretical analysis characterizes the layer’s differentiability and stability properties during model training and backpropagation. Our code for these experiments is publicly available on GitHub (this https URL) and our public Python package documentation can be found online: this https URL.
[LG-112] Parallel Recursive LSTM
链接: https://arxiv.org/abs/2605.17108
作者: Tristan Gaudreault,Yongyi Mao
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures. Code available at this https URL
Abstract:Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context settings. Recurrent models such as LSTMs provide explicit nonlinear state updates and strong state-tracking capabilities, yet their strictly sequential computation limits parallelism. We introduce the Parallel Recursive LSTM (PR-LSTM), a hierarchical recurrent architecture that replaces left-to-right recurrence with recursive nonlinear state composition over a balanced computation tree. Tokens are first mapped independently to latent states, which are then recursively merged by a learned gated composition block. This structure uses the reduction pattern underlying parallel scans as a fixed execution schedule, rather than assuming an associative recurrence. As a result, PR-LSTM retains nonlinear gated state representations while reducing recurrent parallel depth from linear to logarithmic. Empirically, PR-LSTM achieves strong sequence-length generalization on formal-language benchmarks, solving more tasks than standard RNN, LSTM, and Transformer baselines, while avoiding the quadratic scaling of attention. These results suggest that recurrent computation can be reorganized hierarchically to expose parallelism without restricting the transition dynamics to linear or associative forms.
[LG-113] Mechanism Learning: Prototype-Anchored Mechanism Inference for Scientific Forecasting
链接: https://arxiv.org/abs/2605.17091
作者: Qian Jiang,Liping Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scientific forecasting typically relies on direct state prediction, an approach that grows brittle under data scarcity, extended horizons, non-stationary dynamics, or high-dimensional complexity. While raw state trajectories are highly sensitive in these regimes, underlying local evolution rules often exhibit robust reusability. We introduce mechanism learning, a framework that forecasts future states by estimating the currently active local mechanism. Our method compresses local spatiotemporal fragments into mechanism descriptors, forming a data-driven, structured mechanism space where proximity reflects similar local evolution rules. To ground these estimates in observed data, we utilize prototype anchors, a set of representative mechanisms that sparsely cover the space of local rules. We evaluate this approach on Burgers dynamics, WeatherBench2, and Lorenz96. Empirically, the learned mechanism spaces resist collapse and maintain strong local consistency. Compared to direct prediction and other models including FNO, NODE, LSTM, and reservoir-family methods, our framework demonstrates predictive gains in fragile regimes: it significantly improves switching stability in Burgers dynamics and achieves state-of-the-art performance both under the scarce-data fixed-horizon WeatherBench2 protocol and in intermediate-complexity Lorenz96. Ablation studies and drift diagnostics confirm that these improvements are driven by finite prototype anchoring rather than sheer latent capacity. Together, these results establish mechanism learning as a principled, robust alternative to direct state prediction in forecasting complex systems.
[LG-114] aming Audio VAEs via Target-KL Regularization ICASSP2026
链接: https://arxiv.org/abs/2605.17085
作者: Prem Seetharaman,Rithesh Kumar
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2026 (Barcelona, Spain, 3-8 May 2026). 5 pages, 1 figure, 3 tables
Abstract:Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.
[LG-115] he Range Shrinks the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort USENIX-SECURITY2025
链接: https://arxiv.org/abs/2605.17062
作者: Aleksandr Churilov(Independent Researcher)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 12 pages, 3 figures, 4 tables. Replication of Spracklen et al. (USENIX Security 2025). Data and code: this https URL and this https URL
Abstract:Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting – the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) – an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen’s 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.
[LG-116] Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning
链接: https://arxiv.org/abs/2605.17058
作者: Vivienne Huiling Wang,Tinghuai Wang,Joni Pajarinen
类目: Machine Learning (cs.LG)
*备注: 34 pages, 8 figures, 23 tables
Abstract:The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.
[LG-117] Empirical evaluation of Time Series Foundation Models for Day-ahead and Imbalance Electricity Price Forecasting in Belgium
链接: https://arxiv.org/abs/2605.17045
作者: Chi Bui,Maria Margarida Mascarenhas,Arnaud Verstraeten,Hussain Kazmi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Time Series Foundation Models (TSFMs) promise zero-shot forecasting capabilities with minimal task-specific training. While these models have shown strong performance across generic benchmarks, their applicability in volatile, complex electricity markets remains underexplored. Addressing this gap, this study provides a systematic empirical evaluation of several TSFMs, specifically Chronos-2 and Chronos-Bolt (developed by Amazon), and TimesFM 2.5 (provided by Google), for forecasting Belgian day-ahead and imbalance electricity prices. For both considered markets, Chronos-2 in ARX mode produces the most accurate forecasts. Compared with the best ensemble prediction from other machine learning methods, Chronos-2’s Mean Absolute Error (MAE) is 5% lower for the day-ahead market. In contrast, the model yields 10% higher MAE predicting imbalance prices across all forecast horizons, except for the two-hour-ahead horizon. Moreover, we find that TSFMs exhibit genuine zero-shot forecasting skills but still struggle under extreme market conditions.
[LG-118] Privacy-Preserving Generation Fraud Detection for Distributed Photovoltaic Systems: A Solar Irradiance-Fused Federated Learning Framework
链接: https://arxiv.org/abs/2605.17039
作者: Xiaolu Chen,Chenghao Huang,Yanru Zhang,Hao Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 15 pages
Abstract:The wide adoption of residential photovoltaic (PV) systems introduces new challenges for generation fraud detection (FD). Unlike traditional electricity theft detection, which focuses on electricity consumption-side behavior, PV generation fraud detection (PVG-FD) is complicated by the inherent intermittency and uncertainty of PV generation. The distributed nature of PV systems poses further challenges for centralized PVG-FD approaches due to scalability and privacy concerns. This paper develops a privacy-preserving distributed PVG-FD framework based on federated learning (FL). In this framework, a utility company manages multiple household communities, where each of which is equipped with a local detector. The framework integrates a novel detection model architecture with privacy-preserving global collaboration. Each community’s local model fuses PV generation and weather data via a co-attention mechanism to detect discrepancies critical for PVG-FD. The FL framework enables cross-community collaboration by aggregating model parameters and prototypes, leveraging global knowledge sharing with local refinement while preserving privacy. It also uses prototype alignment to address class imbalance by enhancing fraud sample representation. Extensive experiments on a real-world residential PV dataset validate the effectiveness of the developed method and demonstrate that it outperforms state-of-the-art FL methods across various scenarios. The results also show its scalability across varying community sizes and strong robustness to class imbalance.
[LG-119] Why Do Reasoning Models Lose Coverag e? The Role of Data and Forks in the Road
链接: https://arxiv.org/abs/2605.17026
作者: Ngoc-Hieu Nguyen,Parshin Shojaee,Phuc Minh Nguyen,Nan Zhang,Chandan K Reddy,Khoa D Doan,Rui Zhang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 13 figures
Abstract:Recent progress in large language models has led to the emergence of reasoning models, which have shown strong performance on complex tasks through specialized fine-tuning procedures. While these methods reliably improve pass@1 accuracy, prior works have observed that they show a coverage shrinkage behavior, where pass@k degrades relative to the base model. In this paper, we investigate the reasoning shrinkage arise under SFT-based post-training. We hypothesize that this behavior is driven by properties of the fine-tuning data, specifically related to decision points or “forks in the road” scenarios where model faces indecipherable patterns with multiple valid reasoning paths. To test this hypothesis, we design controlled case studies that simulate such decision-point settings, spanning indecipherable nodes in graph branching, and reasoning modes. By tracking post-training dynamics in these settings, we find that the shrinkage phenomenon is tightly correlated with the prevalence of decision-point scenarios in the training data. We also demonstrate that this shrinkage behavior can be partially mitigated through targeted data synthesis design of decision-points, and a more systematic diversity-encouraging decoding mechanism. Our findings identify data-centric factors as a key driver of shrinkage in reasoning models and highlight diversity-aware designs as an effective lever for controlling it.
[LG-120] Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
链接: https://arxiv.org/abs/2605.16999
作者: Peng Cui,Boyao Yang,Jun Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean–corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.
[LG-121] Decision-Aware Proximal Bridge Learning for Optimal Treatment Selection
链接: https://arxiv.org/abs/2605.16989
作者: Tomàs Garriga,Alejandro Almodóvar,Axel Brando,Gerard Sanz,Eduard Serrahima de Cambra,Juan Parras
类目: Machine Learning (cs.LG)
*备注:
Abstract:Individualized treatment selection with continuous actions requires accurate causal response estimation in decision-relevant regions, rather than uniformly over the entire action space. Estimating a global causal response surface and then choosing the treatment that maximizes it can therefore be suboptimal, since standard estimation objectives allocate modeling effort according to the observed treatment distribution rather than the regions that determine the optimal decision. While decision-aware approaches have been studied in unconfounded settings, this problem remains underexplored in proximal causal inference, where proxy variables and bridge functions enable identification under suitable assumptions even in the presence of hidden confounding. Despite recent progress, proximal methods have primarily focused on treatment-effect and potential-outcome estimation rather than treatment selection and optimal decision-making. To bridge this gap, we introduce a policy-targeted weighted bridge loss that emphasizes decision-relevant treatment regions while retaining global stabilization. We prove a regret bound showing that the proposed weighted bridge loss controls treatment-selection regret through a weighted ill-posedness constant. We instantiate the framework in decision-aware variants of several proximal bridge solvers, yielding practical algorithms that alternate between weighted bridge estimation, response-surface projection, policy update, and weight refinement. Empirically, we find that decision-aware weighting reduces regret across several bridge solvers, suggesting improved treatment selection in proximal settings.
[LG-122] Emulating the Forced Response of Climate Models with Flow Matching
链接: https://arxiv.org/abs/2605.16929
作者: Graham Clyne,Julia Kaltenborn,Peer Nowack,Claire Monteleoni,Anasatase Charantonis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Global climate models are essential tools to simulate past and potential future pathways of climate change, as well as associated climate impacts. Shared Socioeconomic Pathways (SSPs) describe a range of future scenarios of global economic and demographic development. These SSPs are intrinsically linked to changes in climate forcings, the external drivers, such as greenhouse gas and aerosol emissions, which in turn lead to the human impact on the energy balance of the Earth over time. These forcings are fundamental boundary conditions in climate models in order to gain insight into the potential climatic impacts of these changes described by each SSP. Running a climate model, however, is extremely computationally expensive, conflicting with the need for large ensembles of simulations for each model to give, e.g., more robust estimates in the presence of internal variability (the inherent, chaotic fluctuations within the climate system) and scenario uncertainty. Recent research has demonstrated the ability to capture climate model dynamics using machine learning when conditioned on forcings from different climatic scenarios. We here train a Deep Learning (DL) model on multiple SSPs and successfully generate scenarios unseen during training. Our emulator is validated against MESMER-M, a statistical emulator of land surface temperature. Our research demonstrates the capacity to generate such changing climate states in response to a variety of simultaneous climate forcings (e.g., carbon dioxide, methane, nitrous oxide, sulphate aerosols, and ozone). In particular, our ablation studies underline a need to include a range of different forcings to represent long-term atmospheric trends with a DL emulator.
[LG-123] ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery
链接: https://arxiv.org/abs/2605.16902
作者: Haofei Yu,Jiaxuan You,Peter Clark,Bodhisattwa Prasad Majumder,Kyle Richardson
类目: Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Scientific artifacts such as models and datasets are foundations for research. With the rapid growth of platforms like HuggingFace, researchers now have access to a large number of artifacts. Yet, a key challenge remains: how can we automatically discover the state-of-the-art (SOTA) model for a given dataset by fully leveraging existing artifacts? We formalize this task as automatic SOTA discovery by modeling HuggingFace as an artifact graph, where nodes are models/datasets and edges represent evaluations. We propose ArtifactLinker, a two-stage framework: (1) ranking promising unobserved model–dataset links using Graph Neural Networks (GNNs) or graph-augmented Large Language Models (LLMs), and (2) verifying top-ranked links via coding experiments with LLM-based agents. We further introduce a benchmark named ArtifactBench with 14,053 artifacts and 51,337 relations to evaluate the performance of both stages. Results show that (1) graph structures between existing artifacts are effective for missing link prediction; (2) end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results and research insights.
[LG-124] nsor Channel Equivariant Graph Neural Networks for Molecular Polarizability Prediction
链接: https://arxiv.org/abs/2605.16891
作者: Jean Philip Filling,Daniel Franzen,Michael Wand
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a tensor-channel equivariant graph neural network for direct prediction of molecular polarizability tensors. Building on the efficient PaiNN architecture, we augment the hidden representation with explicit symmetric rank-2 tensor channels aligned with the decomposition of polarizability into isotropic and anisotropic components. In contrast to approaches that construct tensor outputs only at readout, our model propagates tensor structure throughout message passing using geometrically motivated tensor bases. This yields a target-aligned architecture for tensor-valued molecular prediction. On optimized QM7-X geometries, the proposed model achieves lower full-tensor and anisotropic error than both a PaiNN-style readout baseline and a dielectric MACE baseline under matched training conditions and at nearly identical parameter count. In this controlled setting, it also outperforms MACE while remaining substantially faster at inference. Ablation studies show that the gain does not arise from increased capacity alone, but from the combination of explicit tensor propagation and a traceless target parameterization matched to the anisotropic part of the polarizability tensor. Among the tensor bases considered, the strongest results are obtained from interactions between learned directional features, indicating that these are particularly effective for modeling molecular polarizability. Rotational equivariance tests further confirm that all compared models are numerically equivariant, so the observed improvements are attributable to better learning of the target tensor itself. Overall, our results show that for structured tensor-valued targets, propagating target-aligned tensor features can outperform both readout-only tensor construction and a more general higher-order equivariant model in the present training setting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.16891 [cs.LG] (or arXiv:2605.16891v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16891 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-125] SE-GA: Memory-Augmented Self-Evolution for GUI Agents ICML2026
链接: https://arxiv.org/abs/2605.16883
作者: Shilong Jin,Lanjun Wang,Zhuosheng Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent’s foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0% on ScreenSpot and 75.8% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: this https URL
[LG-126] Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning ICML2026
链接: https://arxiv.org/abs/2605.16823
作者: Takayuki Kimura
类目: Machine Learning (cs.LG)
*备注: 7 pages, 6 figures. Submitted to ICML 2026 Workshop on Foundation Models for Life Sciences
Abstract:Molecular representation learning has become a central approach in AI-driven drug discovery, yet existing molecular tokenizations such as SMILES remain largely syntactic and do not naturally align with chemically meaningful substructures. In this work, we introduce VQ-Atom, a semantic discretization framework that converts continuous atom-level graph representations into discrete tokens corresponding to local chemical environments. Using graph neural network embeddings and vector quantization, atoms are assigned to codebook entries representing chemically meaningful atomic contexts. These discrete tokens define a molecular language suitable for Transformer-based pretraining. We evaluate VQ-Atom in protein-ligand interaction prediction under a protein-cold split setting without relying on 3D structural information. Experimental results show that VQ-Atom consistently improves predictive performance compared to conventional tokenization approaches, suggesting that semantically grounded discretization can substantially enhance molecular representation learning. Our findings indicate that token design itself plays a critical role in enabling effective language modeling for chemistry. Comments: 7 pages, 6 figures. Submitted to ICML 2026 Workshop on Foundation Models for Life Sciences Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.16823 [cs.LG] (or arXiv:2605.16823v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16823 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-127] Universal Graph Backdoor Defense: A Feature-based Homophily Perspective
链接: https://arxiv.org/abs/2605.16815
作者: Mengting Pan,Fan Li,Chen Chen,Xiaoyang Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures
Abstract:Graph neural networks (GNNs) have achieved remarkable success in relational learning. However, their vulnerability to graph backdoor attacks (GBAs) poses a significant barrier to broader adoption in high-stakes applications. Despite recent advances in graph backdoor defense (GBD), existing methods primarily focus on subgraph-based GBAs, relying on the assumption that poisoned target nodes are explicitly connected to subgraph triggers. Our empirical results reveal that such structure-centric approaches fail to defend against emerging feature-based GBAs that preserve graph topology. Therefore, in this paper, we study a novel problem of universal graph backdoor defense. First, we investigate the shared effects of both attack types from a feature-based homophily perspective, which characterizes local feature consistency between nodes and their neighborhoods. Thorough theoretical and empirical analyses demonstrate that, regardless of trigger mechanisms, backdoors induced by GBAs exhibit lower feature-based homophily than clean nodes, indicating a discrepancy in local feature similarity. Motivated by this insight, we propose to leverage node-level local feature consistency, modeled by a neighbor-aware reconstruction loss, to distinguish backdoors from clean nodes. Then, a robust training strategy is developed to eliminate trigger effects while reducing noise induced by detection uncertainty. Extensive experiments demonstrate that our framework significantly degrades the attack success rate and maintains competitive clean accuracy under both subgraph-based and feature-based attacks.
[LG-128] Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy
链接: https://arxiv.org/abs/2605.16812
作者: Youngmok Ha,Viktor Schlegel,Yidan Sun,Anil Anthony Bharath
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:While Local Differential Privacy (LDP) serves as a foundational primitive for distributed data collection, its stringent noise injection requirement often leads to severe degradation in data utility. This degradation stems from the task-agnostic nature of conventional LDP mechanisms, which inject noise uniformly across all dimensions regardless of their relative importance to the downstream objective. To address this issue, we propose a novel approach that mitigates noise in task-relevant subspaces of the data representation. Our method identifies task-critical subspaces via the Jacobian matrix of the public downstream model, selectively attenuates noise along those dimensions, and reshapes the isotropic noise of standard LDP into an anisotropic distribution. This method preserves the uniform per-dimension privacy budget while heterogeneously modulating noise impact across dimensions, thereby substantially enhancing data utility. Furthermore, our approach generalizes to both linear and non-linear models and integrates seamlessly with existing mechanisms. Extensive experiments on CIFAR-10-C (Brightness corruption at the highest severity level 5) demonstrate that integrating our approach improves the utility of PrivUnit2 and PrivUnitG by approximately 20% at \epsilon=7.5 . The source code is available at \urlthis https URL.
[LG-129] Informative Graph Structure Learning
链接: https://arxiv.org/abs/2605.16809
作者: Shen Han,Zhiyao Zhou,Jiawei Chen,Sheng Zhou,Canghong Jin,Hai Lin,Da Zhong Li,Bingde Hu,Can Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The quality of graph-structured data is fundamental to the success of modern graph analysis techniques such as Graph Neural Networks (GNNs). However, real-world graph data is often suboptimal, suffering from issues such as noise and incomplete connections. Graph Structure Learning (GSL) has emerged as a promising technique that adaptively optimizes node connections. However, we observe that the effectiveness of GSL often comes at the cost of a dramatic expansion in edge count, resulting in significant storage and computational overhead. In this work, we reveal that this limitation stems from the prevalent use of similarity-based edge construction, which predominantly connects highly similar neighbors based on their embeddings, introducing substantial structure redundancy. To address this, we propose a novel Informative Graph Structure Learning method (InGSL), which jointly considers both similarity and diversity in edge construction by incorporating a mutual-information-guided learning strategy. Notably, InGSL serves as a plug-in module that can be seamlessly integrated into existing GSL frameworks. Through extensive experiments on six representative GSL methods, we demonstrate that InGSL achieves significant performance improvements at a reduced number of edges. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.16809 [cs.LG] (or arXiv:2605.16809v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16809 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-130] PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting
链接: https://arxiv.org/abs/2605.16793
作者: Yangyou Liu,Zezhi Shao,Xinyu Chen,Hu Chen,Fei Wang,Yuankai Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting under non-stationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non-stationary dynamics through three physical hypotheses: wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics-informed, plug-and-play framework adopting a Disentangle–Evolve–Simulate design philosophy. Specifically, PULSE utilizes phase-anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic-Aware Mixup (SAM) to ensure robustness against out-of-distribution volatility. Empirically, PULSE enables a simple MLP backbone to achieve state-of-the-art or highly competitive performance across 12 real-world benchmarks. This validates that a correct physics-informed inductive bias is far more critical than raw architectural complexity for non-stationary forecasting. The code is available at: this https URL.
[LG-131] Lever: Speculative LLM Inference on Smartphones
链接: https://arxiv.org/abs/2605.16786
作者: Tuowei Wang,Fengzu Li,Yanfan Sun,Wei Gao,Ju Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints. For drafting, it builds token trees using an I/O- and compute-aware gain-cost objective. For verification, it prunes low-value branches through early-exit prediction to reduce target-model computation. For execution, it maps speculation efficiently across mobile CPU-NPU hardware to improve utilization. Comprehensive evaluations show that Lever reduces inference latency by an average of 2.93x over baseline flash-offloaded inference and 1.50x over conventional speculative decoding, narrowing the latency gap between flash-backed and memory-resident LLM inference. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.16786 [cs.LG] (or arXiv:2605.16786v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16786 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-132] Propagation of Chaos in Contextual Flow Maps
链接: https://arxiv.org/abs/2605.16747
作者: Shi Chen,Zhengjiang Lin,Kaizhao Liu,Philippe Rigollet
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注: 31 pages, 1 figure
Abstract:We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length n becomes a statistical resource. Exploiting the McKean–Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate n^-1/d for general CFMs and parametric rate n^-1/2 for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward–adjoint system, both of which may be of independent interest.
[LG-133] ransformer-Based MCS Prediction for 5G Multicast-Broadcast Services (MBS)
链接: https://arxiv.org/abs/2605.16735
作者: Kasidis Arunruangsirilert,Jiro Katto
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 2026 IEEE 104th Vehicular Technology Conference (VTC2026-Fall), 6-9 September 2026, Boston, Massachusetts, USA
Abstract:The deployment of 5G Multicast-Broadcast Services (MBS) is emerging as a critical technology for spectral-efficient UHD content delivery and serving as a promising solution to modernize CATV deployment. However, unlike unicast networks that rely on RLC-AM with HARQ retransmissions, MBS broadcast operates in RLC Unacknowledged Mode (RLC-UM), where the absence of a feedback loop means packet loss is permanent and immediately impacts user QoE. Conventional link adaptation algorithms, designed for unicast, typically aggressively maximize throughput and fail in this risk-intolerant environment, resulting in severe video stalls and rebuffering. To address this, we propose a lightweight Transformer-based framework that predicts the success probability of all 28 MCS indices over an upcoming video segment horizon. Utilizing a unique commercial network dataset with 0.5 ms slot-level granularity, we train our model using a custom Asymmetric Safety Loss function that penalizes channel overestimation to prioritize link stability. Experimental results show that our approach achieves a reliability score of 86.89%, significantly outperforming standard AI baselines optimized for raw throughput (31.65%) while maintaining a safe conservative bias. Furthermore, the model is optimized for real-time applications, demonstrating an inference time of less than 0.07 ms on COTS 5G-era smartphones.
[LG-134] Isolating Nonlinear Independent Sources in fMRI with β-TCVAE Models
链接: https://arxiv.org/abs/2605.16708
作者: Qiang Li,Shujian Yu,Jesus Malo,Jingyu Liu,Tülay Adali,Vince D. Calhoun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages, 2 figures
Abstract:Learning meaningful latent representations from nonlinear fMRI data remains a fundamental challenge in neuroimaging analysis. Traditional independent component analysis, widely used due to its ability to estimate interpretable functional brain networks, relies on a linear mixing assumption for latent sources, limiting its ability to capture the inherently nonlinear and complex organization of brain dynamics. More recently, deep representation learning methods have emerged as promising alternatives for modeling nonlinear latent structure. However, many of these approaches have been evaluated primarily on simulated datasets or natural image benchmarks, with comparatively limited validation on real-world neuroimaging data such as fMRI. In this work, we are motivated by the \beta -TCVAE (Total Correlation Variational Autoencoder), a refinement of the \beta -VAE framework for learning latent representations without introducing additional hyperparameters during training. We adapt and modify this model to fMRI data for nonlinear source disentanglement, aiming to separate mixed spatial and temporal brain signals into interpretable components. We show that the \beta -TCVAE framework can recover meaningful nonlinear spatial components with biological relevance, including well-established intrinsic connectivity networks such as the default mode network. Furthermore, we evaluate the learned representations using functional network connectivity, showing that the latent structure captures coherent and interpretable brain organization patterns. This study provides a pilot investigation that bridges nonlinear representation learning and fMRI analysis.
[LG-135] On-Device Interpretable Tsetlin Machine-Based Intrusion Detection for Secure IoMT
链接: https://arxiv.org/abs/2605.16707
作者: Rahul Jaiswal,Per-Arne Andersen,Linga Reddy Cenkeramaddi,Lei Jiao,Ole-Christoffer Granmo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 11 figures, 6 Tables, submitted to IEEE Intelligent Conference on Intelligence and Security Informatics (ISI-2026), Cambridge, UK
Abstract:The rapid evolution of digital health technologies is redefining healthcare services worldwide. The integration of wireless communication and Internet-enabled medical devices within Internet of Medical Things (IoMT) networks enables continuous, real-time patient monitoring. However, this increased connectivity raises cybersecurity and patient safety risks due to increasingly sophisticated cyberattacks. This paper proposes a novel on-device, interpretable Tsetlin Machine ™-based Intrusion Detection System (IDS) to identify various phases of cyberattacks in IoMT environments. The TM is a rule-driven and transparent machine learning (ML) approach that represents attack patterns using propositional logic. Extensive evaluations on the MedSec-25 dataset, encompassing various phases of realistic cyberattacks, show that the proposed model outperforms ML models and state-of-the-art methods, attaining a classification performance of 97.83%. Moreover, the proposed model offers explicit explanations of its decisions to enhance transparency using feature-level contributions, class-wise vote scores, and clause activation heatmaps. Edge deployment (Raspberry Pi) further supports real-time on-device inference and intrusion detection. The combination of interpretability and high performance makes the proposed model well-suited for IoMT healthcare, where trust, reliability, safety, and timely decision-making are critical.
[LG-136] Convex Dataset Valuation for Post-Training ICML’26
链接: https://arxiv.org/abs/2605.16704
作者: Siqi Zeng,Christopher Jung,Rui Li,Zhe Kang,Ming Li,Nima Noorshams,Zhigang Wang,Fuchun Peng,Han Zhao,Xue Feng
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML '26. 30 pages, 8 figures
Abstract:Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings. The code is available at this https URL.
[LG-137] Your SaaS Is an Insurance Product: A Modeling Framework
链接: https://arxiv.org/abs/2605.16699
作者: Caio Gomes(Magalu)
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注: 23 pages, 2 figures, 7 tables. Companion code archived at DOI https://doi.org/10.5281/zenodo.20213155
Abstract:Capped-usage SaaS products – LLM subscriptions such as Claude Code and ChatGPT, cloud platforms such as Vercel and Cloudflare Workers, corporate benefit platforms, identity-verification services with liability transfer – share a structural signature with insurance products: a fixed premium decoupled from realized consumption, stochastic per-user demand with heavy-tailed severity, a non-fungible cap that resets on a fixed schedule, and a portfolio-level exposure that requires reserve adequacy under tail risk. We argue that this is not an analogy. It is the same operational problem actuarial science has been tooled for decades to address, restated with new dependent variables (tokens, bandwidth bytes, function-invocations, gym check-ins) in place of medical claims. This paper proposes a modeling framework for capped-usage SaaS pricing built from frequency-severity decomposition, premium calculation principles, and Monte Carlo reserve adequacy. We map the framework to publicly observable subscription tiers in two domains (LLM services and cloud platforms), ground it in canonical health-insurance economics (Arrow 1963; Pauly 1968; Manning et al. 1987; Brot-Goldberg et al. 2017), and demonstrate divergence from traditional unit economics through a worked example. The contribution is operational rather than theoretical: not a new theorem, but vocabulary and tools currently absent from cs.LG/stat.ML practice.
[LG-138] UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models ICML2026
链接: https://arxiv.org/abs/2605.16690
作者: Van-Tuan Tran,Hong-Hanh Nguyen-Le,Marco Ruffini,Merim Dzaferagic
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to 45.0% computational reduction on low-resource clients while improving their performance by 8.7 \times compared to existing heterogeneous LoRA-rank methods.
[LG-139] Scalable Knowledge Editing for Mixture-of-Experts LLM s via Tensor-Structured Updates
链接: https://arxiv.org/abs/2605.16686
作者: Roman Maksimov,Vladimir Aletov,Dmitry Bylinkin,Daniil Medyakov,Vladimir Solodkin,Aleksandr Beznosikov
类目: Machine Learning (cs.LG)
*备注: 17 pages, 3 architectures, 1 figure, 6 tables
Abstract:Knowledge editing (KE) provides a lightweight alternative to repeated fine-tuning of LLMs. However, most existing KE methods target dense feed-forward layers, while modern LLMs increasingly adopt Mixture-of-Experts (MoE) architectures for their superior memory footprint and inference efficiency. This mismatch leaves a growing class of production models without principled editing tools. We propose a MEMIT-like framework for knowledge editing in MoE-based LLMs. Our method exploits the tensor structure of MoE layers to formulate the editing objective faithfully at the per expert level, and applies the Woodbury matrix identity to avoid materializing or inverting the full stacked matrix of expert weights. The resulting update reduces to inversions of fixed low-rank matrices and requires no additional backward passes. Empirically, our approach matches the editing quality of strong baselines on the main KE metrics while accelerating the editing procedure by up to 6x, owing to the batched MEMIT-style formulation and the low-dimensional inversions enabled by the Woodbury identity. These results show that closed-form, parameter-modifying KE can be extended efficiently beyond dense layers, opening a path toward scalable knowledge editing in modern sparse LLM architectures.
[LG-140] Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure
链接: https://arxiv.org/abs/2605.16682
作者: Peilun Li,Kaiyuan Tan,Daniel Moyer,Thomas Beckers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying latent state representations and dynamics is essential when direct modeling in observation space is infeasible, particularly under partial and high-dimensional observations. In such settings, representation learning and physics-aware modeling are inherently coupled. We study this problem for latent port-Hamiltonian systems, a structured class encompassing both conservative and dissipative dynamics. We propose a two-stage identify-then-project framework. First, a contrastive teacher learns continuous-time latent dynamics from partial observations. Then, a student projects the identified teacher representation and dynamics onto a port-Hamiltonian submanifold via a learned affine chart, yielding a physically consistent realization. As a conceptual counterfactual, we also consider a single-stage variant that jointly learns latent identification and port-Hamiltonian structure, but find it to be less reliable, motivating the proposed two-stage teacher-student framework. We show theoretically that affine projection is the natural bridge between the affine gauge of contrastive latent identification and the port-Hamiltonian systems. Empirically, we demonstrate that the proposed two-stage approach preserves the teacher’s dynamics while enforcing physical structure, and performs more reliably than the single-stage alternative, particularly in dissipative regimes and high-dimensional visual settings.
[LG-141] In-context learning enables continental-scale subsurface temperature prediction from sparse local observations
链接: https://arxiv.org/abs/2605.16665
作者: Daniel O’Malley,Christopher W. Johnson,Javier E. Santos,Pablo Lara,Sandro Malusà,Bharat Srikishan,John Kath,Arnab Mazumder,Mohamed Mehana,David Coblentz,Nathan DeBardeleben,Earl Lawrence,Hari Viswanathan
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Continental-scale knowledge of subsurface temperature is limited by the cost and sparsity of borehole measurements, but such information is essential for geothermal resource assessment and for understanding heat transport in the shallow crust. The thermal field reflects the interaction between lithology, crustal structure, radiogenic heat production, and advective fluid flow, sometimes producing sharp anomalies that are smoothed by conventional interpolation or difficult to capture with physical models. Here we introduce In-Context Earth, a transformer-based model that uses sparse local borehole observations as geological context to predict continuous temperature-at-depth fields with calibrated uncertainty. In the contiguous United States, the model achieves a mean absolute error of 4.7 °C, outperforming the physics-informed Stanford Thermal Model, a model based on AlphaEarth embeddings, the multimodal Transparent Earth model, and universal kriging, while resolving sharper thermal gradients in geothermal provinces. Its uncertainty estimates are well calibrated, with a Kolmogorov-Smirnov statistic of 2.5%. Without finetuning, the model adapts to Alberta, Australia, and the United Kingdom (UK) using only 20 local observations at inference time, maintaining high accuracy in geologically distinct test regions with a mean absolute error of 2.2 °C in Alberta, 6.2 °C in Australia, and 5.4 °C in the UK. Interpretability analyses show that the model learns internal representations of subsurface properties it never observes during training, including seismic velocities, geochemistry, and crustal structure, and uses these representations in physically consistent ways. More broadly, this work shows that in-context learning can use sparse borehole observations for continental-scale subsurface characterization, without requiring dense measurements or region-specific retraining.
[LG-142] Public-Decay Homomorphic State Space Models for Private Sequence Inference
链接: https://arxiv.org/abs/2605.16647
作者: Luis Brito
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures
Abstract:Fully homomorphic encryption (FHE) changes sequence-model design because rotations, encrypted products, ciphertext materialization, multiplicative depth, and bootstrapping pressure can dominate ordinary neural-network costs. This paper presents public-decay homomorphic state space models (HSSMs), recurrent/state-space blocks whose carried state is updated through ciphertext-plaintext public decay while ciphertext-ciphertext multiplication remains on a local write path. The design keeps a fixed encrypted state across the sequence. The evaluated workflow separates client-side tokenization, frozen fastText lookup, projection, clipping, encryption, decryption, and thresholding from server-side encrypted evaluation over bounded projected features. On full Rotten Tomatoes and SST-2 validation splits, the encrypted HSSM path exactly matches plaintext classifications and reaches 0.7505 and 0.7420 accuracy. Against HE-friendly polynomial attention on the same fastText workloads, HSSM matches or exceeds full-sequence task quality while running about 5x faster. Paired L40S operation-level rows show 1.34-1.62x lower latency than cached final-token polynomial attention, 30-258x lower latency than full-sequence polynomial attention, and lower logical encrypted-state footprint. A T = 16/32 comparator with encrypted public-linear input and Q/K/V projections shows projected HSSM succeeding under depth 8/ring 32768, while projected attention succeeds under depth 10/ring 65536. A matched T = 8 OpenFHE/FIDESlib trace finishes at final level 3 and noise-scale degree 2 on both backends. These results make public-decay carry a practical FHE co-design lever for encrypted sequence inference from bounded projected features.
[LG-143] he Score Kalman Filter
链接: https://arxiv.org/abs/2605.16644
作者: Kaito Iwasaki,Anthony Bloch,Taeyoung Lee,Maani Ghaffari
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 56 pages, 27 figures
Abstract:A central obstacle in nonlinear Bayesian filtering is representing the belief distribution. Moment-based filters address this by propagating polynomial moments and reconstructing a density from them. Recent work completes the predict-update loop via the maximum-entropy (MaxEnt) principle, but each step requires the partition function and its gradient, both n -dimensional integrals whose cost scales exponentially, restricting the demonstrated MaxEnt moment filtering to n \le 4 . We avoid the partition function entirely by combining score matching with Stein’s identity. In our setting, score matching reduces the density fit to a single linear solve whose coefficients are assembled directly from the propagated moments. The same parameters then drive Stein’s identity to close the moment hierarchy during prediction and to recover posterior moments after each Bayesian update, keeping the full predict-update loop free of partition function evaluation. The resulting Score Kalman Filter (SKF) reduces to the classical information-form Kalman filter as a special case and performs every step through linear algebra. On nonlinear coupled-oscillator networks, the SKF runs through n=20 and reports lower RMSE than the EKF, UKF, EnKF, and particle-filter baselines on the tested synthetic benchmarks.
[LG-144] Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
链接: https://arxiv.org/abs/2605.16640
作者: Tomasz Steifer
类目: Machine Learning (cs.LG)
*备注: Under review at a ML conference
Abstract:We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently O(1) chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
[LG-145] MedMIX: Modality-Internal Expert Fusion for Multimodal Medical Diagnosis
链接: https://arxiv.org/abs/2605.16639
作者: Seungik Cho,Anqi Li,Wei Qiu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal clinical prediction faces three challenges: multiple foundation models (FMs) with complementary strengths per modality, pervasive missing modalities at training and test time, and sample-specific variation in modality contributions. We introduce MedMIX, a multimodal framework that combines intra-modality expert fusion, learned inter-modality fusion, and training-only large–small model collaboration for robust medical prediction under incomplete modalities. Within each modality, MedMIX aggregates complementary embeddings from multiple small expert models; across modalities, it performs learned fusion over available modalities; and during training, it leverages large teacher models to improve deployed representations without additional inference cost. Across three heterogeneous benchmarks (OpenI, MIMIC-IV-MM, and MMIST-ccRCC), MedMIX achieves consistently strong performance while remaining robust under controlled missing-modality perturbations, and further demonstrates sustained robustness under cross-cohort shift on MIMIC-III. These results highlight MedMIX as a practical framework that unifies within-modality expert collaboration, sample-specific cross-modality fusion, and efficient large–small model collaboration while remaining robust to incomplete modalities.
[LG-146] Does Weight Decay Enhance Training Stability?
链接: https://arxiv.org/abs/2605.16622
作者: Marius Saether,Amir Kolic,Tomaso Poggio,Pierfrancesco Beneventano
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 16 figures
Abstract:In modern deep learning, weight decay is often credited with “stabilizing” training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: does weight decay stabilize training dynamics, and if so, through which mechanism? Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emphEdge of Stability (EoS). We show that weight decay robustly slows *progressive sharpening. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical \frac2\eta boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.
[LG-147] SCOUT: Cyclic Causal Discovery Under Soft Interventions with Unknown Targets
链接: https://arxiv.org/abs/2605.16620
作者: Alpar Turkoglu,Muralikrishnna G. Sethuraman,Faramarz Fekri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning causal relationships between variables from data is a fundamental research area with many applications across disciplines. Most existing causal discovery algorithms rely on the assumptions that (i) the underlying system is acyclic, (ii) the exogenous noise variables are Gaussian, and (iii) the intervention targets for the data-generating experiments are known. While these assumptions simplify the analysis, they are violated in real-life systems. Most existing methods that address these issues either assume the underlying model is linear or are constrained to operate in limited interventional settings. To that end, we propose SCOUT, a novel causal discovery framework for learning nonlinear cyclic causal relationships from soft interventional data with unknown targets. Our approach maximizes the data log-likelihood to recover the graph structure, using two normalizing-flow architectures: contractive residual flows and neural spline flows. Through experiments on synthetic and real-world data, we show that SCOUT outperforms state-of-the-art methods in both causal graph recovery and unknown target recovery across various interventional and noise settings.
[LG-148] MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
链接: https://arxiv.org/abs/2605.16616
作者: Sasi Kiran Gaddipati,Diyana Muhammed,Farhana Keya,Gollam Rabby,Sören Auer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autonomous research systems capable of generating complete scientific manuscripts have advanced rapidly, yet robust and realistic evaluation frameworks have failed to keep pace. To bridge this gap, we introduce MLReplicate, an end-to-end benchmark evaluating autonomous research systems on machine learning reproducibility. The benchmark was constructed from ICML 2025 outstanding papers reformulated into standardized input specifications and evaluated across 6 state-of-the-art research systems: AI SCIENTIST-V1, AI SCIENTIST-V2, AGENT LABORATORY, CYCLERESEARCHER, AI RESEARCHER, and TINY SCIENTIST, yielding 45 generated manuscripts, with 3 failed experiments. Outputs are assessed using a dual-protocol approach that combines automated conference-style review and structured expert human evaluation, while tracking computational cost, runtime, and the amount of required human intervention. The automated conference-style review accepted 10 out of 37 valid submissions. An additional 8 submissions were desk-rejected before review for failing to meet the minimum page threshold. In contrast to automated reviews, human reviewers consistently identified methodological flaws, hallucinated experimental results, and reproducibility failures across all systems, and 59% of accepted automated reviews contained fabricated or unsupported claims. We further find that neither token budget nor computational cost predicts output quality: the cheapest system outperforms the most resource-intensive system in human evaluation, despite a 38-fold difference in input tokens. We thus demonstrate that autonomous research workflow design matters more than the scale of compute. MLReplicate exposes a substantial gap between current autonomous research systems and genuine scientific rigor, and establishes a practical, extensible evaluation framework for systematic progress toward trustworthy AI-driven scientific discovery.
[LG-149] Learning What Evaluators Value: A Reliable Approach to Modeling Evaluator Preferences
链接: https://arxiv.org/abs/2605.16615
作者: Madeline Celi Kitch,Nihar B. Shah
类目: Machine Learning (cs.LG)
*备注:
Abstract:In many applications, human and LLM evaluators use assessments of relevant criteria to create an overall evaluation for an item or individual. For example, in admissions, committees assess candidates on attributes such as test scores, GPA, and research experience to evaluate their overall fit for the program. Another example arises in medical care where clinicians use patient reports of symptoms to consider preliminary diagnoses and assess risks. Each setting involves mapping multiple criteria to an overall evaluation – a process that reflects the evaluator’s underlying preferences. We focus on the fundamental question of learning these preferences. Many applications of this problem make specific modeling assumptions on evaluator preferences that may be substantially violated in the real world. We make the minimal assumption that the preference function is coordinate-wise non-decreasing, which is reasonable in a large number of evaluation settings. We theoretically characterize the severity of model mismatch for many common assumptions and show that it can lead to significant issues for learning evaluator preferences and other important downstream tasks. We then present an algorithm for learning evaluators’ preferences that is robust to model mismatch. We prove theoretically that our algorithm can learn any preference function without sacrificing performance when the linearity assumption holds. Evaluations of our algorithm with synthetic simulations and real-world data confirm its ability to learn preferences robustly and illustrate key aspects of LLM and human preferences. Subjects: Machine Learning (cs.LG) MSC classes: 62G08 ACMclasses: I.2.6; G.3; J.4 Cite as: arXiv:2605.16615 [cs.LG] (or arXiv:2605.16615v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-150] nsor Cookbook: Mastering Tensors through Diagrams
链接: https://arxiv.org/abs/2605.16610
作者: Beheshteh T. Rakhshan,Guillaume Rabusseau
类目: Machine Learning (cs.LG); General Literature (cs.GL)
*备注:
Abstract:High-dimensional data arise naturally in many areas of science and engineering, including machine learning, signal processing, computational physics, and statistics. Such data are often represented as tensors, multi-dimensional generalizations of matrices. While tensors provide a natural representation for multi-modal structure, their direct manipulation quickly becomes challenging as the order grows: the number of parameters increases exponentially, and algebraic expressions involving many indices become difficult to interpret and implement. Tensor networks (TNs) provide an effective framework for addressing these challenges. Originally introduced by Penrose and developed extensively in quantum physics, the graphical language of tensor networks encodes contractions as edges in a graph, reducing notational overhead and revealing structural properties obscured by index notation. Despite the central role of high-dimensional tensors in modern machine learning and numerical analysis, tensor network diagrams remain underutilized outside quantum computing, partly due to the lack of a self-contained mathematical reference accessible to a broad technical audience. This manuscript provides a self-contained guide to tensor networks and their use in tensor algebra. We present the main operations on tensors, contractions, products, and reshaping through, graphical notation, and show how classical tensor decompositions and related computations are naturally expressed in this framework. We also illustrate how tensor networks simplify the derivation of gradients and the manipulation of high-dimensional probability distributions. Throughout, we show that the diagrammatic approach yields genuinely shorter and more transparent proofs of classical identities, rank bounds, and gradient formulas that would otherwise require laborious index manipulation.
[LG-151] R2V Agent : Teaching SLMs When to Ask for Help
链接: https://arxiv.org/abs/2605.16604
作者: Raghu Vamshi Hemadri,Humaira Firdowse Mohammed,Rishabh Maheshwary,Srivatsava Daruru,Sagar Davasam,Vikas Yadav,Srinivas Sunkara,Sai Rajeswar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbfR2V-Agent, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy’s residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves 94.3% HumanEval+ success with 0.60% LLM escalation, recovers TextWorld from 64.6% SLM-only success to 98.2% at 41.7% escalation, and reaches 93.3% TerminalBench success at 33.9% LLM calls, roughly half the heuristic-router cost.
[LG-152] fPINN-DeepONet: A Physics-Informed Operator Learning Framework for Multi-term Time-fractional Mixed Diffusion-wave Equations
链接: https://arxiv.org/abs/2605.16594
作者: Binghang Lu,Zhaopeng Hao,Christian Moya,Guang Lin
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we develop a physics-informed deep operator learning framework for solving multi-term time-fractional mixed diffusion-wave equations (TFMDWEs). We begin by deriving an L_2 approximation, which achieves first-order accuracy for the Caputo fractional derivative of order \beta \in (1,2) . Building upon this foundation, we propose the fPINN-DeepONet framework, a novel approach that integrates operator learning with the L_2 approximation to efficiently solve fractional partial differential equations (FPDEs). Our framework is successfully applied to both fixed and variable fractional-order PDEs, demonstrating the framework’s versatility and broad applicability. To evaluate the performance of the proposed model, we conduct a series of numerical experiments that involve dynamically varying fractional orders in both space and time, as well as scenarios with noisy data. These results highlight the accuracy, robustness, and efficiency of the fPINN-DeepONet framework.
[LG-153] Structure-Aware Masking for Protein Representation Learning
链接: https://arxiv.org/abs/2605.16581
作者: Thomas Walton,Ayan Goel,Amirali Aghazadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, we show that these improvements arise from mask placement rather than span size, establishing masking as a positional inductive bias.
[LG-154] World Model-Enabled Causal Digital Twins for Semantic Communications in Physical AI Systems
链接: https://arxiv.org/abs/2605.16547
作者: Lingyi Wang,Tingyu Shui,Walid Saad,Pascal Adjakple
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semantic communication has emerged as a promising paradigm for enabling goal-oriented networking. However, most existing semantic communication solutions are tailored to one-shot tasks and optimize instantaneous performance. Hence, they cannot be used to support closed-loop dynamic systems with physical artificial intelligence (AI), in which the transmitted semantics affect not only the current inference outcome but also future control actions, state evolution, and ultimately long-horizon task performance. To address this gap, this paper investigates goal-oriented semantic communications for physical AI systems with closed-loop sensing-communication-inference-control. In particular, the problem of semantic communications is formulated as a long-term return-per-bit maximization under wireless bit-budget constraints while capturing both control efficiency and communication efficiency. To solve this problem, a novel causal information value (CIV) metric is introduced to evaluate the marginal contribution of each semantic token to the expected long-term return by transmission interventions. Then, a world-model-enabled causal digital twin (WM-CDT) framework is proposed to capture the dynamics of closed-loop physical AI systems and enable counterfactual reasoning for long-horizon imagined rollouts. Based on these imagined rollouts, an actor-critic policy is trained for long-horizon agent control with high data efficiency, while the semantic token selector is trained through CIV-per-bit evaluation. Extensive simulations on an AirSim-Sionna-based unmanned aerial vehicle (UAV) navigation simulator show that the proposed WM-CDT framework achieves significant improvement in return-per-kbit and navigation success rate compared to existing reinforcement learning solutions.
[LG-155] Boundedly Rational Meta-Learning in Sequential Consumer Choice
链接: https://arxiv.org/abs/2605.16532
作者: Mehrzad Khosravi,Max Kleiman-Weiner,Hema Yoganarasimhan
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:
Abstract:Many consumer decisions are repeated choices under uncertainty. Standard models capture these decisions using Bayesian learning and dynamic programming: consumers update beliefs from feedback and use those beliefs to guide future choices. In many markets, however, learning does not restart when consumers enter a new context: prior experience with a brand, product, or provider can shape beliefs in later, related decisions. We study this cross-context knowledge transfer, or meta-learning, in sequential choice. We design a hierarchical laboratory task in which participants repeatedly choose among airlines across routes and observe noisy binary outcomes. Reduced-form evidence shows that participants improve not only within routes, but also across routes: they choose better airlines earlier in later routes and reduce pseudo-regret. To identify the mechanism behind this transfer, we compare human choices to a no-transfer benchmark and a fully integrated Bayesian meta-learning benchmark. In particular, we introduce a class of boundedly rational meta dynamic programming policies, BRMDP(D), that approximate full integration using a limited number of hyper-posterior draws, denoted by D. Trial-by-trial likelihood comparisons show that low-D boundedly rational meta-learning, especially BRMDP(1), fits participant behavior better than both no transfer and fully integrated Bayesian transfer. Consumers, therefore, transfer brand-level regularities across contexts, but through coarse representations of prior uncertainty. The findings imply that models of consumer learning should allow for approximate cross-context transfer, and that managerial counterfactuals based on either no-transfer or fully integrated learning can be misleading.
[LG-156] Multiscale Supervised Unbalanced Optimal Transport Flow Matching
链接: https://arxiv.org/abs/2605.16529
作者: Qiangwei Peng,Lezhi Chen,Peijie Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Unbalanced optimal transport (UOT) provides a principled framework for modeling single-cell transitions and birth-death dynamics, but its high computational cost limits scalability to large-scale datasets. Although single-cell data often contain hierarchical annotations and known transition priors, existing UOT approximations rarely exploit this multiscale structure or prior knowledge. We introduce Multiscale Supervised Unbalanced Optimal Transport Flow Matching (MUST-FM), a simulation-free framework that scales UOT by leveraging hierarchical data structure. MUST-FM further supports an optional supervised formulation that incorporates transition priors, such as cell lineages, to guide the learning of displacement fields and mass variations. Experiments show that MUST-FM reduces computational overhead while achieving robust and biologically meaningful trajectory inference, enabling dynamic modeling of atlas-scale single-cell datasets.
[LG-157] Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing
链接: https://arxiv.org/abs/2605.16520
作者: Zeji Yi,Chaoyi Pan,Guanya Shi,Guannan Qu
类目: Machine Learning (cs.LG)
*备注: 57 pages, 5 figures
Abstract:Sampling-based optimization (SBO), like cross-entropy method and evolutionary algorithms, has achieved many successes in solving non-convex problems without gradients, yet its convergence is poorly understood. In this paper, we establish a non-asymptotic convergence analysis for SBO through the lens of smoothing. Specifically, we recast SBO as gradient descent on a smoothed objective, mirroring noise-conditioned score ascent in diffusion models. Our first contribution is a landscape analysis of the smoothed objective, demonstrating how smoothing helps escape local minima and uncovering a fundamental coverage-optimality trade-off: smoothing renders the landscape more benign by enlarging the locally convex region around the global minimizer, but at the cost of introducing an optimality gap. Building on this insight, we establish non-asymptotic convergence guarantees for SBO algorithms to a neighborhood of the global minimizer. Furthermore, we propose an annealed SBO algorithm, Diffusion-Inspired Dual-Annealing (DIDA), which is provably convergent to the global optimum. We conduct extensive numerical experiments to verify our landscape results and also demonstrate the compelling performance of DIDA compared to other gradient-free optimization methods. Lastly, we discuss implications of our results for diffusion models.
[LG-158] QuChaTeR: A Hybrid Quantum-Chaotic Temporal Framework for Earthquake Prediction ICASSP2026
链接: https://arxiv.org/abs/2605.16454
作者: Emir Kaan Özdemir
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantum Physics (quant-ph)
*备注: Accepted at 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP 2026). This is the accepted version of the paper. The final published version will appear in the IEEE proceedings. Proc. IEEE ICASSP 2026, Barcelona, Spain, 2026
Abstract:Seismic prediction remains challenging due to the highly nonlinear and chaotic dynamics of earthquake signals. While classical deep learning models such as LSTMs and CNNs capture local temporal features, and quantum models offer richer state representations, their integration with chaos-driven mechanisms is underexplored. We introduce QuChaTeR, a hybrid architecture that combines wavelet-based preprocessing, chaotic maps, and variational quantum circuits with recurrent structures to enhance temporal feature extraction. Implemented in PyTorch and PennyLane, QuChaTeR is benchmarked against classical (LSTM, GRU, RNN, 1D-CNN, Reservoir Computing) and quantum-inspired (Quantum LSTM) baselines. On real-world seismic datasets, QuChaTeR consistently converges faster and achieves superior performance across multiple evaluation criteria. Despite promising results, scalability and quantum hardware limitations remain challenges. Overall, this work demonstrates how quantum-chaotic hybridization provides a practical pathway toward more accurate and robust earthquake prediction.
[LG-159] OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence
链接: https://arxiv.org/abs/2605.16395
作者: Jiajian Li,Jingyuan Huang,Junru Gong,Qi Wang,Xiaokang Yang,Yunbo Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL
Abstract:We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop – spanning from explicit state transitions to visual observation generation – OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.
[LG-160] M2FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices
链接: https://arxiv.org/abs/2605.16375
作者: Manjil Nepal,Kimsie Phan,Tamoghna Ojha,Aritra Dutta,M Krishna Siva Prasad
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Accurate air quality prediction is essential for public health, environmental monitoring, and industrial safety. However, most existing approaches rely on centralized learning paradigms, which introduce challenges related to scalability, privacy preservation, and communication overhead in distributed Internet of Things (IoT) environments. Moreover, current federated learning (FL) based solutions predominantly utilize unimodal data, limiting their capability to capture complex environmental patterns. To address these limitations, we propose M ^2 FedAQI, a lightweight multimodal federated framework for decentralized Air Quality Index (AQI) prediction across heterogeneous edge devices. The proposed framework integrates visual and tabular modalities through a feature modulation based fusion mechanism that enables efficient cross-modal interaction while maintaining low computational overhead. M ^2 FedAQI is evaluated on two benchmark datasets, PM25Vision and TRAQID, for both classification and regression tasks under centralized and federated settings. Experimental results demonstrate that M ^2 FedAQI consistently outperforms existing approaches, achieving improvements of up to 11.0% in Accuracy, 3.53% in AUC, 12.2% in F1-score, and 18.0% in R^2 , while reducing MAE and RMSE by up to 25.4% and 20.4%, respectively, compared with the strongest baselines. Furthermore, deployment on heterogeneous edge devices demonstrates efficient resource utilization in terms of communication overhead, memory footprint, and computational cost. To enhance communication security, TLS-based authentication is incorporated to ensure secure client participation and protect the FL communication channel from unauthorized third-party access without modifying the underlying FL protocol.
[LG-161] Machine Learning-Based Pre-Test Risk Stratification for PCR-Confirmed Chlamydia Using Patient-Reported Data and Urine Biomarkers
链接: https://arxiv.org/abs/2605.16365
作者: Mehrab Mahdian,Marko Lehes,Katrin Krolov,Tamas Pardy
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Early identification of individuals at elevated risk of Chlamydia trachomatis infection may enable optimal use of molecular testing in resource-aware screening. We evaluate the feasibility of pre-test risk stratification (PTRS) using machine-learning models trained on routinely available, non-invasive clinical data. A curated dataset of 93 urine samples with PCR reference labels was analyzed using three feature groups: patient-reported history and symptoms, urine biomarkers from standard urinalysis, and their combination. Five supervised classifiers were evaluated using stratified 5-fold cross-validation with out-of-fold probability estimates. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-dependent metrics, with uncertainty quantified via bootstrap confidence intervals. Models using only patient-reported data showed moderate discrimination (AUC up to 0.72). Urine biomarker-based models demonstrated slightly lower peak discrimination but more consistent performance, with ensemble methods yielding the strongest results. Combining feature groups marginally increased the peak AUC and reduced performance variability across models, indicating improved robustness. Findings indicate that urine biomarkers provide a reliable predictive signal for PTRS that is complementary to patient-reported information, while feature integration enhances robustness. This work supports the integration of non-invasive, routinely available information for PTRS into screening workflows, including decentralized or home-based PCR contexts, to optimize testing prioritization. Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2605.16365 [cs.LG] (or arXiv:2605.16365v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16365 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Mehrab Mahdian [view email] [v1] Sat, 9 May 2026 19:21:59 UTC (510 KB)
[LG-162] ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage
链接: https://arxiv.org/abs/2605.16363
作者: Wenbo Gao,Songbai Tan,Zhongan Wang,Fei Shen,Gang Xu,Huiping Zhuang,Yunyun Yang,Ming Li,Xiaofeng Zhu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Smartphone scams are increasingly prevalent and typically manifest as multi-stage, cross-application processes with gradually emerging intent. Effective intervention thus requires anticipating scams before the intent becomes explicit. This is inherently challenging, as decisions must rely on partial trajectories with temporally distributed evidence. In this paper, we propose \textbfORACLE Online Reasoning for Anticipating Cross-temporal Latent thrEats, the first agentic framework for early scam anticipation from \textitstreaming app-usage trajectories. To support this setting, we curate a real-world long-horizon benchmark of streaming app-usage trajectories, covering 12 scam types, spanning extended periods (15 days on average), involving diverse applications (95 apps), and interleaving normal and scam behaviors. To address fragmented evidence, we introduce a self-evolving context manager that adaptively consolidates entity-centric interactions over time, enabling more effective reconstruction of cross-temporal evidence from partial observations. To enhance sensitivity to latent early-stage signals, we propose an on-policy self-distillation scheme in which a teacher model, conditioned on summarized anti-scam reflections and clues by skills, supervises a student model without access to such reflections. This scheme thereby distills evidence-informed knowledge and improves recognition of emerging fraud patterns from partial trajectories. Experiments show that \method consistently improves early scam anticipation, yielding timely warnings while reducing false alerts in realistic streaming scenarios.
[LG-163] Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap
链接: https://arxiv.org/abs/2605.16349
作者: Feilong Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) architectures achieve scalable capacity through sparse routing, yet the geometric structure of expert specialization remains poorly understood. We introduce a unified Jacobian-PCA-Grassmann framework for analyzing MoE layers in both function space and representation space. Across pretrained MoE Transformers (Mistral, Qwen), we find a consistent structural asymmetry: experts exhibit strong functional decorrelation (consistently low, near-zero cross-expert Jacobian alignment) while their routed representations occupy distinct but partially overlapping subspaces. This indicates that functional decorrelation and representation overlap coexist rather than coincide in MoE specialization. Controlled routing experiments further indicate that routing sparsity appears to be a key factor shaping this geometry: top-k routing induces sharper functional separation and larger subspace divergence, whereas fully soft routing yields more entangled expert structure. Together, these results suggest a geometric interpretation in which MoE layers may be viewed as implementing locally decorrelated operators over overlapping submanifolds on a shared representation manifold, and provide a general diagnostic framework for studying conditional computation in modern Transformer architectures.
[LG-164] HPC-LLM : Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support
链接: https://arxiv.org/abs/2605.16347
作者: Nourin Shahin,Izzat Alsmadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern scientific research increasingly depends on High-Performance Computing (HPC) infrastructures, yet many researchers face significant operational barriers when interacting with cluster environments, job schedulers, GPU resources, and parallel computing frameworks. General-purpose large language models (LLMs) provide useful coding assistance but often lack the domain-specific operational knowledge required for reliable HPC support. This paper presents HPC-LLM, a retrieval augmented and domain-adapted assistant designed to support common HPC workflows including Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting. The proposed framework integrates automated documentation ingestion, dense retrieval, lightweight domain adaptation using QLoRA, and local inference within a modular orchestration pipeline. To support domain adaptation, we construct an HPC-oriented corpus from publicly available university HPC documentation, curated operational examples, and synthetic instruction-answer pairs generated from retrieved HPC content. The resulting dataset contains approximately 9,000 to 24,000 HPC-focused training examples spanning job scheduling, GPU computing, distributed training, storage systems, and cluster administration topics. We fine-tune Llama 3.1 8B using QLoRA and evaluate the resulting model against several open weight baselines under retrieval-augmented settings on JetStream2 infrastructure. Experimental results indicate that the adapted 8B model achieves performance comparable to substantially larger general-purpose models while operating under significantly lower GPU memory requirements and inference latency. In particular, the adapted model approaches the performance of Qwen 2.5 14B while requiring substantially fewer computational resources.
[LG-165] Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization
链接: https://arxiv.org/abs/2605.16341
作者: Tatsuhiro Nakamori,Laura Gomezjurado Gonzalez,Ganesh Talluri,Ansh Tiwari,Hideyuki Kawashima,Ioannis Mitliagkas,Guillaume Rabusseau,Hiroki Naganuma
类目: Machine Learning (cs.LG)
*备注: 24 pages, 3 figures, 11 tables
Abstract:Low-rank gradient compression reduces communication in distributed training by representing updates with rank- r factors. Dion is a recent method that approximates Muon, a spectral optimizer that orthogonalizes momentum, using one step of power iteration followed by column normalization (rescaling each column of the right factor to unit length). This makes it compatible with fully sharded data parallel training, but it converges more slowly than full-rank spectral methods. We show that this gap is geometric: column normalization does not yield the rank- r polar factor that Muon implicitly targets, so the resulting direction violates the dual-norm constraint of the low-rank spectral geometry, and the rate picks up an extra factor of \sqrtr even though the low-rank approximation of the gradient itself is accurate. The same mismatch enters the smoothness term and the error-feedback recursion in the analysis, which has a knock-on effect on empirical performance. We propose Orth-Dion, which replaces column normalization with QR orthogonalization of the right factor. Under non-Euclidean smoothness, with L_r the curvature constant along rank- r directions, Orth-Dion attains rate O(\sqrtL_r/T) , matching exact spectral methods at the same per-step communication cost as Dion. The proof removes the bounded-drift assumption common in prior error-feedback analyses via a self-consistent fixed-point argument, and uses a time-averaged contraction that only requires the error sequence to contract on average rather than at every step. Experiments on large-scale language model pre-training validate the predicted \sqrtr scaling and show that Orth-Dion closes the convergence gap to Muon at Dion’s communication cost.
[LG-166] Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
链接: https://arxiv.org/abs/2605.16339
作者: Shunchang Liu,Xin Chen,Belen Martin Urcelay,Francesco Croce
类目: Machine Learning (cs.LG)
*备注:
Abstract:Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over-reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE-based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features to restore correct preferences. Our methods substantially reduce incorrect preference assignments on harmlessness and hallucination benchmarks while preserving benign performance and general utility on other tasks, without retraining the reward model. Our code and data are available in \urlthis https URL.
[LG-167] Bi-Level Chaotic Fusion Based Graph Convolutional Network for Stock Market Prediction Interval
链接: https://arxiv.org/abs/2605.16324
作者: Eshwar Sai Kandimalla,Sravan Chowdary Kankanala,Sumana Bhimineni,Hem Sundhar Korukunda,Vivek Yelleti
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Statistical Finance (q-fin.ST)
*备注:
Abstract:Financial market forecasting is inherently uncertain, yet most deep learning approaches rely on point predictions that provide only single-value estimates without quantifying uncertainty. Such predictions are insufficient for risk-aware decision-making, as they fail to capture the range of possible outcomes and the associated confidence of this http URL problem can be solved using prediction intervals, which allow obtaining an upper and lower bound for the prediction, thus enabling uncertainty representation in the model. Yet, the current methods tend to disregard relationships between assets or cannot simultaneously ensure good calibration and sharpness of the resulting intervals in dynamically changing market regimes. In our work, we propose a spatio-temporal graph-based approach with a bi-level chaotic fusion technique to solve this problem. Our model uses separate nonlinear transformation functions to estimate the interval center and width. Additionally, a volatility-aware gating mechanism is used to make predictions dependent on the regime in which the market operates. Temporal dependencies are considered by embedding graph structures and sequentially modeling them. Training is conducted according to a Lower-Upper Bound Estimation (LUBE) objective. Our experimental results show significant improvements compared to existing baselines (LSTM, GRU, GCN, HGNN) when applied to data from 2016 to 2026 with 43 leading companies in eight sectors of the NSE. It provides the lowest Winkler score (0.0778), tightest prediction intervals (PIAW = 0.1407), and highest coverage (PICP = 96.6%), with all differences statistically significant (p 0.001) according to the Diebold-Mariano test.
[LG-168] Language Game: Talking to Non-Human Systems
链接: https://arxiv.org/abs/2605.16321
作者: Yanbo Zhang,Michael Levin
类目: Machine Learning (cs.LG)
*备注: 29 pages, 12 figures, 7 tables
Abstract:Language carries thought and coordination among humans but rarely reaches further along the spectrum of diverse intelligence. Yet non-neural systems – from gene regulatory networks and microbial consortia to fungi – are increasingly recognized as substrates of computation, decision-making and memory, making dialogue with non-human intelligence newly conceivable. Today such dialogue is attempted only by proxy: a large language model speaks on the system’s behalf, so any intelligence on display originates from the model while the system itself remains silent. Here we ask whether the system can speak in its own voice. Following Wittgenstein, who located meaning in use, we treat communication as a game played with the system. Its internal dynamics are frozen as the nonlinear core of a reinforcement-learning policy, with only linear input and output interfaces trained. Through use and reward, the system’s states and responses acquire meaning within the game, so playing becomes speaking. Because different architectures playing the same game optimize the same reward, their behaviors can all be read as pursuit of that reward; the game serves as a lingua franca across otherwise irreconcilable representations. Given a human prompt, a language model routes it to the game whose semantics best match it and designs an environmental state for which the desired action is the rational response, letting the system reply through its own behavior. Applied across diverse gene regulatory networks and reinforcement-learning tasks, the framework yields fluent dialogue without altering any system parameter, shows that well-trained agents of disparate origin converge on similar behavior, and reveals that specific GRN properties make a system easier or harder to talk with – an inductive bias of the reservoir itself. Our framework opens a new route to conversing with any dynamical system on its own terms.
[LG-169] Forecasting Medium-Horizon Alzheimers Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
链接: https://arxiv.org/abs/2605.16319
作者: Ran Tong,Tong Wang,Lanruo Wang,Yin Ni
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Preprint; includes appendix, 4 figures, and 6 tables
Abstract:Medium-horizon Alzheimer’s disease progression prediction is difficult because future clinical scores can remain tied to baseline severity, while biomarker histories are irregular and incompletely observed. We develop an anchor-based analysis of 24-month Clinical Dementia Rating Sum of Boxes (CDR-SB) change using harmonized Alzheimer’s Disease Neuroimaging Initiative (ADNI) tables. Each labeled sample is anchored at a mild cognitive impairment visit, uses only clinical and biomarker history observed at or before that anchor, and defines the response as CDR-SB at the future visit closest to 24 months within an 18–30 month window minus anchor CDR-SB. The analytic cohort contains 2,600 labeled anchors from 858 participants and 7,276 longitudinal rows. We propose a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning from pre-anchor clinical and biomarker histories. The model uses participant-level random intercepts in the mixed-effects reference, observation-level triplet tokenization for irregular histories, and a learned nonnegative time-gap penalty inside self-attention. We compare the proposed model with a Bayesian-information-criterion-selected linear mixed-effects baseline, GRU-D, and STraTS under repeated participant-level train–test splits. Across five participant-level random seeds, the proposed model achieves the best mean test performance across all reported metrics, reducing MSE by 13.1% and increasing prediction–observation correlation by 26.4% relative to the mixed-effects baseline. It also improves over both GRU-D and STraTS in mean error and correlation. These results show that statistical anchoring and gap-aware residual learning provide a useful structure for medium-horizon Alzheimer’s disease progression prediction.
[LG-170] Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning
链接: https://arxiv.org/abs/2605.16318
作者: Matthew Schlegel,Volodymyr Tkachuk,Adam White,Martha White
类目: Machine Learning (cs.LG)
*备注: Published in TMLR in 2023, https: // openreview. net/ forum? id= K6g4MbAC1r .Transactions on Machine Learning Research (2023)
Abstract:Building and maintaining state to learn policies and value functions is critical for deploying reinforcement learning (RL) agents in the real world. Recurrent neural networks (RNNs) have become a key point of interest for the state-building problem, and several large-scale reinforcement learning agents incorporate recurrent networks. While RNNs have become a mainstay in many RL applications, many key design choices and implementation details responsible for performance improvements are often not reported. In this work, we discuss one axis on which RNN architectures can be (and have been) modified for use in RL. Specifically, we look at how action information can be incorporated into the state update function of a recurrent cell. We discuss several choices in using action information and empirically evaluate the resulting architectures on a set of illustrative domains. Finally, we discuss future work in developing recurrent cells and discuss challenges specific to the RL setting.
[LG-171] SignMuon: Communication-Efficient Distributed Muon Optimization
链接: https://arxiv.org/abs/2605.16311
作者: Neel Mishra,Kushagara Trivedi,Pawan Kumar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 40 pages, 9 figures
Abstract:Distributed training of large neural networks is bottlenecked by full-precision gradient communication and by coordinatewise optimizers that ignore the matrix structure of weight tensors. We propose Sign-Muon, a 1-bit, matrix-aware optimizer that combines majority-vote sign aggregation from signSGD with the polar-step framework of Muon. Each worker forms a Muon-style direction by taking the polar factor of its momentum via a Newton–Schulz iteration, transmits only the entrywise signs, and aggregates by majority vote; an optional local polar step further enforces orthogonality at no extra communication cost. Under spectral-norm smoothness and bounded-variance stochastic gradients, the spectral-norm normalized sign step yields an \mathcalO(1/\sqrtT) nonconvex rate for an \ell_1 -based stationarity measure. With unimodal symmetric noise, majority vote across M workers cuts the stochastic term by 1/\sqrtM , matching signSGD. In the \alpha - \beta model, distributed Sign-Muon needs only one integer sum-allreduce per iteration; all orthogonalization is local, giving a 32\times bandwidth reduction over float32 ( 4\times for int8). Across 330 CIFAR-10/ResNet-50 configurations Sign-Muon attains the best validation accuracy (92.15%); its 4-GPU majority-vote variant reaches 92.02% with 37% less training time at matched effective batch. On nanoGPT, Sign-Muon achieves lower perplexity and better anytime performance than other sign-based baselines, with favorable weak-scaling up to 16 GPUs. Comments: 40 pages, 9 figures Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.16311 [cs.LG] (or arXiv:2605.16311v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.16311 Focus to learn more arXiv-issued DOI via DataCite
[LG-172] Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints
链接: https://arxiv.org/abs/2605.16262
作者: Mohammad S. Alkousa,Fedor S. Stonyakin,Belal A. Alashqar,Seydamet S. Ablaev
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Variational inequalities play a key role in machine learning research, such as generative adversarial networks, reinforcement learning, adversarial training, and generative models. This paper is devoted to the constrained variational inequality problems with functional constraints (inequality-type constraints). We propose some mirror descent-type algorithms that switch between productive and non-productive steps depending on the values of the functional constraints at iterations, with many different step size rules and stopping criteria. We analyze the proposed algorithms and prove their optimal convergence rate to achieve a solution with desired accuracy, for problems with bounded and monotone operators and Lipschitz convex functional constraints. In addition, we propose a modification of the proposed algorithms by considering each functional constraint in the calculation when we have a productive step, as well as the first constraint that violates the feasibility. This modification can save the running time of algorithms when we have many functional constraints. In addition, we provide an analysis of the proposed algorithms for \delta -monotone operators, allowing us to apply the proposed algorithms, as a special case, to constrained minimization problems when we do not have access to the exact information about the subgradient of the objective function. Numerical experiments that illustrate the work and performance of the proposed algorithms are also given.
[LG-173] Quantitative Linear Logic for Neuro-Symbolic Learning and Verification
链接: https://arxiv.org/abs/2605.13845
作者: Thomas Flinkow,Ekaterina Komendantskaya,Matteo Capucci,Rosemary Monahan
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: 23 pages, 2 figures, 13 tables
Abstract:Differentiable Logics are deployed in neuro-symbolic learning tasks as a way of embedding logical constraints in the training objective of neural networks. A differentiable logic consists of a syntax to write logical properties and a semantics to interpret them as real-valued functions to be folded in the loss function. A defining trade-off of the field is that between logical properties of the connectives, and analytic concerns for the semantics, with both aspects being relevant in applications. At one extreme we find fuzzy logics, that have well-established algebraic and proof-theoretic foundations, and at the other ad-hoc differentiable logics like Fischer’s DL2, conceived for deep learning applications. However, no satisfactory foundation has emerged yet. We propose a resolution to this long-standing tension via a novel logic, Quantitative Linear Logic (QLL), with foundational ambitions. Our design is driven by naturality – the idea that, since logical constraints are translated to losses, the semantics of the connectives should be pertinent operations used in ML practice (that is, sum and log-sum-exp) on additive quantities (like logits). We then judge the result on two aspects: logical adequacy – that they satisfy most of the standard logical laws of Linear Logic; and empirical effectiveness – test-time performance (as measured by adversarial attacks) is well-correlated to the actual verification of the logical constraints (as measured by off-the-shelf neural network verifiers), which makes QLL stand out among SoTA techniques.
[LG-174] SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate ICML2026
链接: https://arxiv.org/abs/2605.18745
作者: Lifu Wei,Yinuo Ren,Naichen Shi,Yiping Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Mathematical Finance (q-fin.MF); Computation (stat.CO)
*备注: accepted by ICML 2026
Abstract:Diffusion-based generative models increasingly rely on inference-time guidance, adding a drift term or reweighting mixture of experts, to improve sample quality on task-specific objectives. However, most existing techniques require repeated score or gradient evaluations, introducing bias, high computational overhead, or both. We introduce \textttURGE, Unbiased Resampling via Girsanov Estimation, a derivative-free inference-time scaling algorithm that performs path-wise importance reweighting via a Girsanov change of measure. Instead of computing gradient-based particle weights in previous work, \textttURGE attaches a simple multiplicative weight to each simulated trajectory and periodically resamples. No score, no Hessian, and no PDE evaluation is required. We establish an equivalence between path-wise and particle-wise SMC: the Girsanov path weight admits a backward conditional expectation that recovers the previous particle-level weights, guaranteeing that both schemes produce the same unbiased terminal law. Empirically, \textttURGE outperforms existing inference-time guidance baselines on synthetic tests and diffusion-model benchmarks, achieving better generation quality, while being significantly simpler to implement and fully gradient-free.
[LG-175] Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
链接: https://arxiv.org/abs/2605.18704
作者: Kenan Majewski,Marcin Żugaj
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 49 pages, 9 figures. Preprint submitted to Aerospace Science and Technology
Abstract:Unmanned Aerial Vehicles in dynamic environments face telemetry outages, structural vibrations, and regime-dependent noise that invalidate the stationary covariance assumptions of classical Kalman filters. The Sage-Husa Kalman Filter (SHKF) estimates noise statistics online, but its reliance on a static, scalar forgetting factor forces a strict compromise between steady-state stability and transient responsiveness. We introduce the N-Deep Recurrent Sage-Husa Filter (NDR-SHKF), which replaces this scalar parameter with a vector-valued memory attenuation policy learned by a hierarchical recurrent network operating on whitened innovation sequences. A bifurcated architecture routes shallow recurrent states to capture instantaneous sensor anomalies and deep states to encode sustained dynamic trends, while an auxiliary reconstruction objective prevents feature collapse. The complete filter, including recursive covariance updates, is trained end-to-end via backpropagation through time to directly minimize state estimation error. Evaluations on topologically distinct chaotic attractors demonstrate cross-domain generalization, outperforming purely data-driven baselines that diverge under out-of-distribution dynamics. Furthermore, evaluations on recorded real-world UAV flight datasets validate the framework’s practical viability, demonstrating its capacity to bridge transitions into proprioceptive dead reckoning and outperform classical adaptive estimators during sensor outages.
[LG-176] Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad ICML2026
链接: https://arxiv.org/abs/2605.18694
作者: Zijian Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2026
Abstract:Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular \mathttAdam and \mathttAdamW , often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, \mathttAdaGrad , the origin of adaptive gradient methods. We provide the first provable convergence rate for \mathttAdaGrad in non-convex optimization when the tail index p satisfies 4/3p\leq2 . Notably, this result is achieved without requiring any prior knowledge of p and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by \mathttAdaGrad . Lastly, we consider \mathttAdaGrad\text-\mathttNorm , a popular variant of \mathttAdaGrad in theoretical studies, and show an improved rate that holds for any 1p\leq2 under an extra mild assumption.
[LG-177] Can machine learning for quantum-gas experiments be explainable?
链接: https://arxiv.org/abs/2605.18689
作者: I. B. Spielman amd J. P. Zwolak
类目: Quantum Gases (cond-mat.quant-gas); Machine Learning (cs.LG); Atomic Physics (physics.atom-ph); Quantum Physics (quant-ph)
*备注:
Abstract:Virtually all aspects of many-body atomic physics are challenging: experiments are technically demanding, datasets have become enormous, and the memory and CPU requirements for classical simulation of generic quantum systems often scale exponentially with system size. Machine learning (ML) methods are already assisting in each of these areas and are poised to become transformative. Here, we focus on two specific applications of ML to cold-atom-based quantum simulators. These devices generally generate data in the form of images; we first showcase denoising of raw images and then identify solitonic waves in Bose-Einstein condensates. In both of these examples, we comment on the interplay between performance, model complexity, and interpretability.
[LG-178] PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference
链接: https://arxiv.org/abs/2605.18587
作者: Chenglei Yu*,Chuanrui Wang*,Bangyan Liao,Tailin Wu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 31 pages,12 figures
Abstract:Single-cell trajectory inference from destructive time-course snapshots is fundamentally ill-posed: neither cross-time cell correspondences nor continuous trajectories are observed, so the snapshot distributions alone do not uniquely determine the underlying dynamics. Existing optimal transport and flow-based methods typically couple cells by Euclidean proximity at observed clock times, which can misalign trajectories when development is asynchronous and cells sampled at the same experimental time occupy different latent pseudotime stages. We propose PACE, a trajectory inference framework that recovers geometry-consistent continuous transport dynamics from destructive time-course snapshots through three coupled components. First, PACE constructs a state- and time-dependent anisotropic Riemannian metric that assigns low transport cost along locally supported tangent directions while penalizing normal velocity components. Second, it alternates between refining cross-time couplings under the induced path-action cost and fitting endpoint-preserving neural bridges between adjacent snapshots. Third, it distills the learned bridge dynamics into a global continuous-time velocity field over cellular states. Across seven controlled and biological datasets covering nine held-out reconstruction experiments, PACE achieves the strongest overall reconstruction performance, reducing MMD, Wasserstein-1 distance, and Wasserstein-2 distance by 23.7% on average relative to the strongest competing baseline. PACE also improves RNA-velocity alignment by 15.4% on an embryoid body differentiation benchmark, without requiring explicit cell pairing, lineage tracing, or RNA-velocity supervision during training. Code is available at this https URL.
[LG-179] Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
链接: https://arxiv.org/abs/2605.18528
作者: Jiayu Zhang,Tianyi Lin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 45 pages
Abstract:A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over \mathbbR^m\times n with general norms, where the goal is to achieve an \epsilon -stationary point under p^\mathrmth -moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when \frac\max\m,n(\min\m,n)^2 is large enough, any scale-invariant first-order method with spectral norm requires \Omega(\min\m, n\epsilon^-\frac3p-2p-1) oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of O(\min\m, n\epsilon^-\frac3p-2p-1) . To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to O(\min\m, n\epsilon^-\frac5p-32p-2) when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.
[LG-180] Shallow ReLUs Networks in Lp-Type and Sobolev Spaces: Approximation and Path-Norm Controlled Generalization
链接: https://arxiv.org/abs/2605.18468
作者: Weizhao Li,Fanghui Liu,Lei Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 1 figure. Authors are listed in alphabetical order and contributed equally
Abstract:We study approximation by shallow ReLU ^s networks, \sigma_s(t)=\max0,t^s , and the generalization behavior of such networks under \ell_1 path-norm control. For the L^p -type integral spaces \widetilde\mathcalF_p,\tau_d,s , 1\le p\le2 , we establish approximation bounds for shallow networks using spherical harmonic analysis. In particular, when the parameter measure is the uniform measure \tau_d and pp^*=(2d+2)/(d+3) , we obtain the rate O(m^-1/2-d(2-p)/(2d(2-p)+2p(2s+d+1))\log^3/2m) , which improves the corresponding random-feature rate. We also derive approximation rates for Sobolev spaces W^\alpha,p in the range 1\le p2 by embedding them into spectral Barron spaces. Finally, for nonparametric regression with sub-Gaussian noise, we prove minimax-optimal generalization bounds for path-norm-regularized shallow ReLU ^s networks over Barron and Sobolev spaces, with matching lower bounds up to logarithmic factors.
[LG-181] Generalized Functional ANOVA in Closed-Form: A Unified View of Additive Explanations
链接: https://arxiv.org/abs/2605.18422
作者: Baptiste Ferrere,Nicolas Bousquet,Fabrice Gamboa,Jean-Michel Loubes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 34 pages, 23 Figures, 101 equations, 8 Tables
Abstract:The functional ANOVA, or Hoeffding decomposition, provides a principled framework for interpretability by decomposing a model prediction into main effects and higher-order interactions. For independent inputs, this classical decomposition is explicit. It is closely connected to SHAP values, generalized additive models, and orthogonal polynomial expansions, and therefore constitutes a fundamental tool for additive explainability. In the more general and realistic dependent setting, however, obtaining a tractable representation and estimating the decomposition from data remain challenging. In this work, we address this problem for continuous inputs. By combining Hilbert space methods with the generalized functional ANOVA, we build an explicit decomposition Riesz Basis allowing to easily compute the decomposition. Our formulation recovers the classical independent case and its associated orthogonal decomposition. Building on this representation, we propose a simple but mighty algorithm to estimate the decomposition from a data sample in a model-agnostic setting and we compare it empirically with several state-of-the-art explanation methods, demonstrating the power of the approach.
[LG-182] On Stability and Decomposition of Sample Quantiles under Heavy-Tailed Distributions
链接: https://arxiv.org/abs/2605.18370
作者: Choudur Lakshminarayan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 0 figures
Abstract:We study sample quantiles of distributions indexed by estimated parameters, with a on Value-at-Risk related to linear projections of financial returns that whose underlying probability law is heavy-tailed. In this setting, the projection direction and the empirical quantile threshold are estimated from the data, so the standard Bahadur representation under a fixed distribution does not separate the distinct sources of instability. A canonical starting point is Bahadur’s representation, which expresses the sample quantile through the empirical distribution function plus a remainder term \citebahadur1966. Empirical-process theory provides a usable scaffolding through the mechanics of half-spaces, symmetric differences, and Glivenko–Cantelli uniform convergence. They yield stability bounds, but absorb changes in projection direction and changes in quantile threshold into a single symmetric-difference measure. Interestingly, a global uniform-convergence requirement is imposed on what is intrinsically a local quantile-stability problem. This paper introduces a Q-Q orthogonality formulation for separating projection-direction and quantile-threshold effects. The object of interest is the difference between the empirical quantile computed using the estimated projection direction and the population quantile computed at the reference projection direction. We decompose this difference into three terms, \hat q_\alpha(\hat w)-q_\alpha(w_0)=D_1+D_2+D_3 . Here, D_1 measures the population quantile movement induced by perturbing the projection direction, D_2 measures the empirical quantile fluctuation with the projection direction held fixed, and D_3 is the Bahadur-type remainder. Comments: 0 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2605.18370 [stat.ML] (or arXiv:2605.18370v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.18370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-183] Hybrid Quantum-Classical Neural Architecture Search
链接: https://arxiv.org/abs/2605.18345
作者: Alberto Marchisio,Muhammad Kashif,Nouhaila Innan,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Hybrid quantum-classical neural networks (HQNNs) are emerging as a practical approach for quantum machine learning in the noisy intermediate-scale quantum (NISQ) era, as they combine classical learning components with parameterized quantum circuits in an end-to-end trainable framework. However, their performance and efficiency depend strongly on architectural choices such as data encoding, circuit structure, measurement design, and the coupling between classical and quantum modules. This makes manual design increasingly difficult, especially when hardware limitations and resource constraints must also be taken into account. In this paper, we study the foundations of HQNNs and neural architecture search (NAS), discuss how NAS extends to quantum and hybrid settings, and demonstrate FLOPs-aware search (where FLOPs serve as a proxy for computational complexity), as an important hardware-aware direction for building HQNNs that are not only accurate but also computationally efficient and practically deployable.
[LG-184] Robust Player-Conditional Champion Ranking for League of Legends: Style Similarity Mastery Priors and Archetype-Constrained Discovery
链接: https://arxiv.org/abs/2605.18338
作者: Min Heo,Pranav Kadiyam,Prasun Panthi
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures
Abstract:Champion recommendation in multiplayer online battle arena games is usually framed informally as a problem of metagame strength, personal comfort, or global win rate. We formalize champion recommendation in League of Legends as an interpretable, player-conditional ranking problem under sparse, noisy, and non-stationary behavioral data. The proposed framework combines four information sources: a population-strength proxy, player-style similarity, direct and indirect mastery priors, and archetype-level guardrails. The method uses robust median/MAD normalization, logarithmic transforms for skewed event counts, recency-weighted player style vectors, mastery-weighted champion-pool vectors, weighted cosine similarity, rank-scaled score components, and k-means++ clustering for coarse archetype support. The implemented prototype uses a Python/Pandas modeling layer, Supabase-backed storage, and a web-facing recommendation interface. Unlike black-box supervised win-prediction systems, the proposed method returns decomposed recommendation scores that can be inspected as expected-performance proxy, fit, mastery, and archetype compatibility. A single-player case study on a 100-game history for the player identifier DIVINERAINRACCON is included as an end-to-end sanity check. The manuscript is therefore a methods and systems contribution: it specifies a reproducible, modular, and auditable champion recommender and gives a validation protocol for future large-scale evaluation through temporal train-test splits, next-champion recovery, calibration analysis, and ablation studies.
[LG-185] QLIF-CAST: Quantum Leaky-Integrate-and-Fire for Time-Series Weather Forecasting
链接: https://arxiv.org/abs/2605.18333
作者: Alberto Marchisio,Aayan Ebrahim,Nouhaila Innan,Muhammad Kashif,Muhammad Shafique
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Accurate and efficient time-series forecasting remains a challenging problem for both classical and quantum neural architectures, particularly in multivariate environmental settings. This work adapts the Quantum Leaky Integrate-and-Fire (QLIF) spiking neural network for time-series regression tasks, specifically short-term multivariate weather forecasting. We extend QLIF beyond classification and demonstrate its applicability to continuous-valued prediction problems. The QLIF-CAST model encodes neuron excitation states as single-qubit quantum superpositions, driven by Rx rotation gates and T1 relaxation decay, and is embedded within a hybrid quantum-classical recurrent architecture. We conduct two distinct evaluations. First, a controlled comparison against a parameter-matched classical LIF baseline on a multivariate weather dataset shows that QLIF-CAST achieves 15.4% lower MSE and 4.4% lower MAE, demonstrating that quantum neuronal dynamics reduce prediction error over classical equivalents. Second, a cross-domain comparative analysis with state-of-the-art quantum LSTM (QLSTM) and quantum neural network (QNN) models on air quality and wind speed benchmarks reveals that QLIF-CAST converges in up to 94% less training time, occupying a distinct position in the speed-error trade-off space. Hardware verification on IBM Marrakesh (156-qubit QPU) confirms reliable circuit execution with only 1.2% average deviation from simulation. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2605.18333 [quant-ph] (or arXiv:2605.18333v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.18333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-186] Geometric Dictionary Learning of Dynamical Systems with Optimal Transport
链接: https://arxiv.org/abs/2605.18276
作者: Thibaut Germain,Sami Chemlal,Rémi Flamary,Vladimir R. Kostic,Karim Lounici
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Learning dynamical systems through operator-theoretic representations provides a powerful framework for analyzing complex dynamics, as spectral quantities such as eigenvalues and invariant structures encode characteristic time scales and long-term behavior. However, dynamical operators are typically estimated independently for each system, preventing the discovery of shared structure across related dynamics. To address this limitation, we posit that related dynamical systems lie near a low-dimensional manifold in spectral operator space. Based on this hypothesis, we introduce DOODL (Dynamical OperatOr Dictionary Learning), a framework that learns a dictionary of characteristic spectral dynamics whose combinations approximate this manifold and yield compact, interpretable embeddings of individual systems. Beyond representation learning, DOODL enables fast and interpretable operator estimation from short and partially observed trajectories by constraining the estimation to the learned operator manifold. Experiments on metastable Langevin dynamics and turbulent plasma simulations demonstrate that DOODL scales to highly complex multiscale regimes while capturing characteristic spectral structure governing the dynamics rather than merely fitting trajectories, achieving errors one to two orders of magnitude lower than independent operator estimation methods in challenging low-data regimes.
[LG-187] Subject-Specific Analysis of Self-Initiated Attention Shifts from EEG with Controlled Internal and External Attention Conditions
链接: https://arxiv.org/abs/2605.18251
作者: Yuwen Zeng,Dengzhe Hou,Zhang Zhang,Sai Sun,Yongsong Huang,Chia-huei Tseng,Satoshi Shioiri
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Self-initiated attention shifts play a critical role in voluntary behavior but are difficult to study due to the absence of explicit temporal markers. While previous studies have examined their neural correlates, it remains unclear how multi-dimensional electroencephalography (EEG) features contribute to their characterization within an interpretable computational framework. In this study, we build on an experimental paradigm developed in our previous work, which enables controlled comparison between task-constrained self-initiated shifts and externally instructed shifts under identical visual stimulation. Within this setting, we investigate whether preparatory EEG activity can distinguish these two types of attention shifts. We adopt a machine learning-based approach and conduct two complementary analyses: (1) a performance-oriented assessment of frequency-specific topographic patterns, and (2) a model-based feature attribution analysis using SHapley Additive exPlanations (SHAP). These analyses provide a structured view of how spectral features across regions of interest contribute to model behavior. Our results demonstrate reliable within-subject classification performance, indicating that preparatory EEG activity contains subject-specific discriminative information within this paradigm. The analysis shows that higher-frequency bands and frontal regions contribute strongly to model decisions, although such contributions should be interpreted cautiously due to the potential influence of non-neural artifacts in high-frequency EEG signals. Overall, this work highlights the value of interpretable machine learning for analyzing subject-specific EEG signal patterns in a controlled experimental setting, with potential applications in personalized and asynchronous brain-machine interface systems.
[LG-188] A Unified Framework for Structured Flow Modeling: From Continuous Fields to Data-Driven Representations
链接: https://arxiv.org/abs/2605.18250
作者: Diego Casadei
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG)
*备注:
Abstract:Many dynamical systems can be described in terms of structured flows combining source/sink behavior, cyclic dynamics, and topology-constrained transport. These features arise across a wide range of domains, including physical, engineered, and data-driven systems. This work provides a unified perspective on such systems by connecting continuous formulations based on the Helmholtz-Hodge decomposition with discrete and data-driven representations. We review the recently proposed Graph Vector Field (GVF) framework, which enables a decomposition of complex dynamics into gradient, curl, and harmonic components on simplicial complexes, offering both expressivity and interpretability. We then introduce a hierarchy of alternative modeling approaches, including parametric conditional models, linear graph dynamical systems, and reduced Hodge representations, which trade expressive power for computational tractability and reduced data requirements. A key contribution of this work is a cross-domain validation strategy that leverages datasets from well-understood physical systems to verify model correctness and assess robustness independently of the target application domain. This approach enables a systematic evaluation of the trade-offs between model complexity, interpretability, and predictive performance. The resulting framework supports an iterative modeling methodology in which highly expressive models are used as diagnostic tools to identify dominant mechanisms, guiding the construction of simplified models tailored to practical constraints. This work highlights the broad applicability of structured flow modeling and provides a foundation for scalable and interpretable analysis of complex dynamical systems. Subjects: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG) Cite as: arXiv:2605.18250 [physics.data-an] (or arXiv:2605.18250v1 [physics.data-an] for this version) https://doi.org/10.48550/arXiv.2605.18250 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Diego Casadei [view email] [v1] Mon, 18 May 2026 11:52:09 UTC (29 KB)
[LG-189] Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster
链接: https://arxiv.org/abs/2605.18204
作者: Grigory Bartosh,Teodora Pandeva,Sushrut Karmalkar,Javier Zazo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Discrete diffusion models are a powerful class of generative models with strong performance across many domains. For efficiency, however, discrete diffusion typically parameterizes the generative (reverse) process with factorized distributions, which makes it difficult for the model to learn the target process in a small number of steps and necessitates a long, computationally expensive sampling procedure. To reduce the gap between the target and model distributions and enable few-step generation, we propose Forward-Learned Discrete Diffusion (FLDD), which introduces discrete diffusion with a learnable forward (noising) process. Rather than fixing a Markovian forward chain, we adopt a non-Markovian formulation with learnable marginal and posterior distributions. This allows the generative process to remain factorized while matching the target defined by the noising process. We train all parameters end-to-end under the standard variational objective. Experiments on various benchmarks show that, for a given number of sampling steps, our approach produces a higher quality samples than conventional discrete diffusion models using the same reverse parameterization.
[LG-190] Canonical Regularisation of Wide Feature-Learning Neural Networks
链接: https://arxiv.org/abs/2605.18180
作者: George Whittle,Pranav Vaidhyanathan,Juliusz Ziomek,Natalia Ares,Maike A. Osborne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks – of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution – and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.
[LG-191] Buffer-Parameterized Machine Learning Surrogate Models for Cross-Technology Signal Integrity Analysis and Optimization
链接: https://arxiv.org/abs/2605.18170
作者: Julian Withöft,Werner John,Emre Ecik,Ralf Brüning,Jürgen Götze
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 12 pages, 16 figures, 7 tables. This work has been submitted to the IEEE for possible publication
Abstract:Signal integrity (SI) analysis in printed circuit board (PCB) interconnects faces increasing complexity due to diverse integrated circuit (IC) buffer technologies, varying operating conditions, and manufacturing tolerances. Existing machine learning (ML) surrogate models for predicting SI metrics such as the inner eye contour, eye-height (EH), eye-width (EW), and transient waveform features typically rely on fixed buffer parameters, requiring costly new data generation and retraining cycles for every technology shift. This paper introduces a buffer-parameterized ML surrogate modeling methodology capable of handling cross-technology variations without retraining by treating IC buffer characteristics, e.g., clock frequency, supply voltage, rise/fall times, jitter, and internal resistors and capacitors, as dynamic model inputs alongside PCB parameters. To identify the optimal surrogate architecture for this high-dimensional space, a comprehensive benchmarking study compares tree-based methods (RFR/GBM), kernel methods (SVR/KRR), Gaussian process regression (GPR), and neural networks. The framework is subsequently validated on a complex interconnect with 44 design parameters. Results show that while anisotropic GPR excels in low-data regimes, neural networks heavily outperform other models on large datasets. Finally, the practical value of the ML surrogate models is demonstrated through a cross-technology design space exploration and optimization scenario, showcasing massive computational speedups for eye mask compliance checking compared to simulation.
[LG-192] Wasserstein bounds for denoising diffusion probabilistic models via the Föllm er process
链接: https://arxiv.org/abs/2605.18069
作者: Yuta Koike
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 45 pages
Abstract:This paper studies sampling error bounds for denoising diffusion probabilistic models (DDPMs) in the 2-Wasserstein distance. Our contributions are threefold. (i) Under general Lipschitz-type conditions on the score function and for a broad class of variance schedules, including the cosine schedule, we establish sharp upper bounds that are optimal in both the dimension and the number of steps, and recover several sharp error bounds previously obtained in the literature. (ii) We prove that the same Lipschitz-type conditions, which encompass those commonly imposed on the (learned) score, imply a logarithmic Sobolev inequality and hence a quadratic transportation cost inequality for the DDPM. As a consequence, in settings covered by existing work, an optimal Wasserstein bound, up to a logarithmic factor, follows from the recently obtained sharp error bound in the Kullback-Leibler divergence under geometric-type variance schedules. (iii) We show that for general log-concave target distributions, the optimal Wasserstein error bound remains attainable even without a quadratic transportation cost inequality for the target. Our analysis is based on viewing the DDPM sampler as a discretization of the Föllmer process rather than the conventional reverse Ornstein-Uhlenbeck process.
[LG-193] A note on connections between the Föllm er process and the denoising diffusion probabilistic model
链接: https://arxiv.org/abs/2605.18040
作者: Yuta Koike
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 32 pages
Abstract:The Föllmer process is a Brownian motion conditioned to have a pre-specified distribution at time 1. This process can be interpreted as an “augmented” time-compressed version of the reverse stochastic differential equation (SDE) for the denoising diffusion probabilistic model (DDPM). While this fact has been indirectly used to analyze DDPM sampling errors via discretization of the reverse SDE, connections between direct discretization of the Föllmer process and the DDPM sampler have not yet been fully explored. This note aims to clarify this point while surveying relevant results from existing work. We show that discretized Föllmer processes give natural hyper-parameter settings of the DDPM sampler. Moreover, this allows us to systematically recover state-of-the-art results on DDPM sampling error bounds with slight improvements.
[LG-194] Real-time Multi-instrument Autonomous Discovery of Novel Phase-change Memory Materials
链接: https://arxiv.org/abs/2605.18033
作者: Chih-Yu Lee,Haotong Liang,Ryan Kim,Austin McDannald,Carlos A Rios Ocampo,A. Gilad Kusne,Ichiro Takeuchi
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 25 pages, 5 figures
Abstract:Autonomous labs enable the integration of automated experiment execution, data analysis and decision making. The main challenge remains the integration of diverse data streams from multiple instruments, where the data is often heterogeneous and unsynchronized. The standard learning process of undetermined synthesis-process-structure-property relationships (SPSPR) usually relies on post-experiment analysis after data is fully collected, not during live experiments, and decision making is carried out independently across characterization equipment. Here, we demonstrate the Multi-instrument Autonomous Discovery (MAD) framework – combining structural property mapping and functional property optimization simultaneously in a closed-loop manner. As an example, we applied MAD to phase change memory (PCM) materials, and, in particular on the Mn-Sb-Te ternary, a previously unexplored materials system for PCM. A multi-output model is employed to merge data from x-ray diffraction (XRD) and electrical resistance measurements simultaneously through a co-regionalization kernel that models the relationship between them. The output probabilistic posterior and uncertainty quantification facilitate decision making with shared knowledge, while the goals are different across tasks. We aimed to maximize the knowledge of crystal structure distribution using non-negative matrix factorization (NMF), while in parallel, we find the composition with the maximum resistance value, an important figure of merit for PCM. Leveraging MAD, we found promising electrical PCMs and identified the SPSPR within 25 closed-loop iterations, corresponding to a seven-fold speed-up. The framework opens a new path of study in large-scale autonomous facilities, where future experiments can be run in parallel together, not independently.
[LG-195] Long-horizon prediction of three-dimensional wall-bounded turbulence with CTA-Swin-UNet and resolvent analysis
链接: https://arxiv.org/abs/2605.17888
作者: Bo Chen,Yitong Fan,Jie Yao,Weipeng Li
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 40 pages, 18 figures
Abstract:Long-horizon prediction of three-dimensional (3D) wall-bounded turbulence with machine-learning methods remains a challenging task, due to the rapid accumulation of autoregressive errors and the substantially computational cost. To address these challenges, we present a hybrid machine-learning framework, in which a channel-time-attention Swin-UNet (CTA-Swin-UNet) and a multi-time-scale fusion correction (MTFC) strategy are developed to predict the turbulent flow fields in a wall-parallel plane, with affordable computational cost. Then, 3D flow fields are reconstructed via a resolvent-based spectral linear stochastic estimation (SLSE), rooting from the predicted planar flow. Results show that the CTA-Swin-UNet outperforms the baseline models (LSTM, FNO and traditional Swin-UNet) in both single-step prediction and autoregressive rollouts, indicating the effectiveness of introducing the CTA module into the Swin-UNet architecture. At the same temporal interval, the CTA-Swin-UNet remains stable for approximately 150 rollout steps, while the baseline models fail within 20 to 50 rollout steps. After introducing the MTFC strategy, a longer horizon upto 300 steps is achieved. Using the resolvent-based SLSE reconstruction further recovers the 3D flow structures and energy spectral distributions from the predicted planar inputs, which demonstrates that the proposed framework provides an effective and computationally efficient approach for long-horizon autoregressive prediction of 3D wall-bounded turbulence.
[LG-196] Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models
链接: https://arxiv.org/abs/2605.17778
作者: Radu Lecoiu,Debarghya Mukherjee,Pragya Sur
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 103 pages, 8 figures
Abstract:Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with s spikes, s -step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that s steps are necessary for optimality: any (s-k) -step distilled estimator is strictly suboptimal for 1 \leq k \leq s . For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.
[LG-197] Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
链接: https://arxiv.org/abs/2605.17767
作者: Behrad Moniri,Hamed Hassani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes \eta_1 \asymp N^\alpha_1 and \eta_2 \asymp N^\alpha_2 for \alpha_1, \alpha_2 \in [0,0.5) . We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of these outliers is determined by the scaling parameters \alpha_1 and \alpha_2 through \lfloor \frac\alpha_21/2 - \alpha_1 \rfloor . Furthermore, by analyzing the alignment between these learned directions and the target function, we identify a qualitative gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, under the condition that \alpha_1, \alpha_2 are chosen properly. This confirms that the benefits of batch reuse, previously observed in finite-width regimes, persist in the high-dimensional linear-width limit. By characterizing these early-phase spectral transitions, our work establishes a tractable mathematical framework for studying optimization and feature learning phenomenology in modern overparameterized networks.
[LG-198] StatQAT: Statistical Quantizer Optimization for Deep Networks
链接: https://arxiv.org/abs/2605.17745
作者: Mehmet Aktukmak,Daniel Huang,Ke Ding
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Quantization is essential for reducing the computational cost and memory usage of deep neural networks, enabling efficient inference on low-precision hardware. Despite the growing adoption of uniform and floating-point quantization schemes, selecting optimal quantization parameters remains a key challenge, particularly for diverse data distributions encountered during training and inference. This work presents a novel statistical error analysis framework for uniform and floating-point quantization, providing theoretical insight into error behavior across quantization configurations. Building on this analysis, we propose iterative quantizers designed for arbitrary data distributions and analytic quantizers tailored for Gaussian-like weight distributions. These methods enable efficient, low-error quantization suitable for both activations and weights. We incorporate our quantizers into quantization-aware training and evaluate them across integer and floating-point formats. Experiments demonstrate improved accuracy and stability, highlighting the effectiveness of our approach for training low-precision neural networks.
[LG-199] Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ
链接: https://arxiv.org/abs/2605.17724
作者: Mathias Mesfin
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST)
*备注: 18 pages, 4 figures. All results based on out-of-sample walk-forward validation and permutation testing. Data: MNQ futures (2021-2025)
Abstract:This paper compares gradient boosting and long short-term memory (LSTM) architectures for intraday directional prediction in Micro E-Mini Nasdaq 100 futures (MNQ). Motivated by recent foundation-model research on financial candlestick data, including the Kronos architecture, we test whether five-minute OHLCV bar sequences contain exploitable sequential predictive structure at the scale of a single instrument dataset. Using 944 trading days from 2021-2025, four model configurations are evaluated under strict expanding-window walk-forward validation across three out-of-sample periods. The target variable is whether the session close exceeds the 10:30 AM open by more than ten points. No configuration produces statistically significant out-of-sample accuracy above the 51.8% base rate. Combined OOS accuracies range from 50.00% to 50.89% across gradient boosting variants, while the LSTM achieves 50.59%. Permutation tests yield p-values of 0.135 for the best gradient boosting model and 0.515 for the LSTM, indicating no statistically significant predictive edge. Feature importance instability across walk-forward folds suggests noise fitting rather than stable structural signal capture. The results indicate that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting. The primary contribution is a documented evaluation of a Kronos-inspired architecture on a constrained real-world dataset, providing an empirical lower bound on data scale requirements for sequential financial ML.
[LG-200] How does feature learning reshape the function space?
链接: https://arxiv.org/abs/2605.17718
作者: João Lobo,Bruno Loureiro,Long Tran-Than,Fanghui Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 59 pages, 1 figure
Abstract:Feature learning is widely regarded as the key mechanism distinguishing neural networks from fixed-kernel methods, yet its impact on the induced function space remains poorly understood. In this work, we precisely characterize how the function space spanned by the features of a two-layer neural network evolves during gradient descent training. We prove that, in the high-dimensional proportional regime, after a large gradient step the post-update feature distribution is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space and modifies its spectral structure. Our analysis reveals that feature learning can be interpreted as a distributional transformation in either parameter space or input space, equivalently as the introduction of a target-dependent kernel. In particular, it selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling the top radial mode with a target-aligned quadratic harmonic. Overall, our results provide a precise function-space perspective on early-stage feature learning: rather than just rescaling a fixed kernel, gradient descent induces a data-adaptive deformation that preferentially enhances directions aligned with the signal in the data.
[LG-201] Online Conformal Prediction for Non-Exchangeable Panel Data
链接: https://arxiv.org/abs/2605.17705
作者: Daohong Tu,Kay Giesecke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 34 pages, 5 figures
Abstract:Panel data, in which multiple units are repeatedly observed over time, arise throughout science and engineering. Quantifying predictive uncertainty in such settings is challenging because conformal prediction, while distribution-free and model-agnostic, classically relies on exchangeability assumptions that fail under temporal dependence and unit heterogeneity. We propose a simple online conformal framework for non-exchangeable panel data. The method exploits a key feature of online panel prediction: when a forecast is required for one unit, contemporaneous outcomes from related units may already be observed and can serve as a calibration panel. At each round, prediction sets are formed using currently observed calibration units together with two adaptive quantities: history-based similarity weights that emphasize calibration units resembling the target, and an adaptive miscoverage level that is updated whenever target feedback is revealed. This two-state design yields a stepwise coverage bound and a long-run coverage guarantee. Empirically, across synthetic and real panel data sets, the method improves coverage on the worst-covered target units through adaptive interval-width allocation rather than uniform inflation. The two states are complementary: similarity weights protect coverage when target feedback is sparse, while the adaptive level further improves coverage as feedback accumulates.
[LG-202] On Gaussian approximation for entropy-regularized Q-learning with function approximation
链接: https://arxiv.org/abs/2605.17678
作者: Artemy Rubtsov,Rahul Singh,Eric Moulines,Alexey Naumov,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak–Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and a polynomial stepsize k^-\omega , \omega \in (1/2,1) . Assuming that the sequence of observed triples (s_k,a_k,s_k+1)_k \geq 0 forms a uniformly geometrically ergodic Markov chain, and under suitable regularity conditions for the projected soft Bellman equation, we establish a Gaussian approximation bound in the convex distance with rate of order n^-1/4 , up to polylogarithmic factors in n , where n is the number of samples used by the algorithm. To obtain this result, we combine a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term. Finally, we derive high-order moment bounds for the algorithm’s last iterate, which might be of independent interest.
[LG-203] ML-based Fast Simulation of FARICH Responses
链接: https://arxiv.org/abs/2605.17635
作者: Foma Shipilov,Alexander Barnyakov,Vladimir Bobrovnikov,Artem Ivanov,Sergey Kononov,Fedor Ratnikov
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: to be published in 7th International Workshop on Future Tau Charm Facilities (FTCF2025) proceedings
Abstract:A fast simulation of the detector response is a vital task in high-energy physics (HEP). Traditional Monte-Carlo methods form the backbone of modern particle physics simulation software but are computationally expensive. We present a machine-learning-based approach to fast simulation of the Focusing Aerogel Ring Imaging Cherenkov (FARICH) detector response. Given a particle track and momentum, the goal is to generate realistic samples of photon hits on the detector matrix. We propose a conditional Generative Adversarial Network (cGAN) with a lightweight convolutional architecture that reproduces the projected detector response conditioned on particle parameters. We compare the cGAN against a linear statistical baseline using metrics applied to probability maps and to the reconstructed velocity distributions. The cGAN produces realistic samples and provides a significant speed-up over Monte-Carlo simulation.
[LG-204] Longwang: Zero-Shot Global Spatiotemporal Precipitation Downscaling with a Latent Generative Prior
链接: https://arxiv.org/abs/2605.17603
作者: Yue Wang,Daniele Visioni
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:High-resolution precipitation information is essential for climate impact assessment, yet global climate models remain too coarse to resolve key small-scale processes. Existing machine learning downscaling methods often require paired low- and high-resolution data for supervised learning, are tied to fixed regions or scale factors during inference, and can be computationally expensive to train and run in physical space. Here we introduce Longwang, a zero-shot latent generative framework for global spatiotemporal precipitation downscaling. Longwang learns a context-conditioned latent generative prior and combines it with a physically informed observation operator through posterior sampling, enabling daily O(10 km) precipitation fields to be generated from monthly O(100 km) inputs. On ERA5 reanalysis, Longwang outperforms standard posterior sampling with an unconditional generative prior in reconstructing fine-scale spatial patterns, preserving temporal coherence, and recovering extreme precipitation intensities. The framework further generalizes to historical climate simulations and future climate projections under substantial distribution shift.
[LG-205] opological Data Analysis combined with Machine Learning for Predicting Permeability of Porous Media
链接: https://arxiv.org/abs/2605.17581
作者: Ebru Dagdelen,Catherin Neena Lalu,Aakash Karlekar,Manav Arora,Matthew Illingworth,Jonathan Jaquette,Linda Cummings,Lou Kondic
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:
Abstract:Flow in porous media is difficult to address using standard analytical or numerical methods due to its complexity. However, since synthetic representations of porous media are easy to produce and data from physical experiments are becoming more widely available, the problem is well-suited to studies that include machine learning (ML) techniques. We discuss a number of features that can be extracted from such data, and their utility as input variables into a standard ML algorithm. These features include structural measures describing the geometry of the porous media, topological measures describing the connectivity, and network measures obtained by modeling the porous media as simplified pore networks. These features enable the prediction of the permeability of the considered (synthetic) porous materials using ML techniques that also leverage the separately computed exact permeability (ground truth). Comparing results obtained using different input variables helps develop a better understanding of the utility of various measures for predicting permeability based on the porous media structure. We show, in particular, that topological data analysis (TDA) provides a useful set of features that can be easily combined with ML to yield meaningful results.
[LG-206] Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling
链接: https://arxiv.org/abs/2605.17546
作者: Tianyue Yang,Sandro Tacchella,Xiao Xue
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 19 pages, 8 figures
Abstract:Understanding galaxy morphology evolution across cosmic time requires models that can generate realistic galaxy populations conditioned on redshift. In this work, we study efficient redshift-conditioned generative modeling for astrophysical image synthesis using diffusion models and pixel-MeanFlow. We first review the connections between score-based diffusion models, Flow Matching, one-step generative models, and modern diffusion samplers. We then evaluate DDPM, DDIM, DEIS-AB2, DPM++2M, and one-step pixel-MeanFlow on the GalaxiesML-64 dataset using morphology-based metrics, including ellipticity, semi-major axis, Sérsic index, and isophotal area. Our results show a clear accuracy-efficiency trade-off: standard DDPM sampling achieves the best distributional fidelity but requires high computational cost, while second-order samplers substantially improve efficiency over DDIM. Pixel-MeanFlow enables single-step generation and achieves competitive performance on several morphology statistics, though it remains weaker than many-step DDPM for fine-grained structure. Our results demonstrate that one-step generative models can recover key galaxy morphology statistics at orders-of-magnitude lower computational cost, opening a path toward efficient conditional simulators for large cosmological surveys and simulation-based scientific inference.
[LG-207] Causal Anomaly Detection for Lithium-Ion Battery Degradation
链接: https://arxiv.org/abs/2605.17334
作者: Dieter W. Heermann,Hagen Heermann
类目: Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Reliable early detection of lithium-ion battery degradation requires health indicators that are physically interpretable and computable from routine cycler telemetry without access to the degradation region. We introduce \textscCausalHealth, a framework that applies causal graph discovery and k -nearest-neighbour transfer entropy to per-cycle voltage, current, temperature, and resistance time series, and organises twelve resulting anomaly scores into three signal-class bundles (Magnitude-shift, Predictive-residual, Complexity-entropy) – with Isolation Forest reported separately as it falls below the bundle reliability threshold – to characterise detection sensitivity across ten commissioning fractions (5–30,%). The Magnitude-shift class achieves 100,% detection across all seven tested cells spanning LFP (MIT–Stanford MATR) and LCO (NASA PCoE, CALCE CS2) chemistries, with a lead time of up to 402 cycles before conventional capacity-threshold failure on gradual-fade cells. A Reliability-Weighted Master Health Index (RWMHI) – a cross-bundle fusion of five high-reliability detectors weighted by inverse coefficient of variation – improves lead time by 15–52 cycles over the class median on long-lived cells while maintaining 100,% detection. Validation against electrochemical impedance spectroscopy on an NMC prismatic cell provides independent physical grounding: transfer entropy \mathrmTE(R !\to! V) correlates with charge-transfer resistance R_\mathrmct (pooled r = +0.990 ; temperature-controlled partial r = +0.898 ), and an Arrhenius analysis of both quantities yields an activation energy consistent with published NMC charge-transfer kinetics. These results are evaluated on seven cells across three benchmark datasets.
[LG-208] FEG-Pro: Forecast-Error Growth Profiling for Finite-Horizon Instability Analysis of Nonlinear Time Series
链接: https://arxiv.org/abs/2605.17282
作者: Andrei Velichko,N’Gbo N’Gbo,Bruno Carpentieri,Mudassir Shams
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 31 pages, 9 figures, 43 references
Abstract:Estimating the largest Lyapunov exponent from a scalar time series is difficult when the governing equations, tangent dynamics, and full state vector are unavailable. We propose FEG-Pro, a forecast-error growth profiling framework for nonlinear scalar time series. The method constructs autocorrelation-guided sparse histories, performs distance-weighted k-nearest-neighbor multi-horizon forecasting, and analyzes the logarithmic growth of geometrically averaged forecast errors. Its primary output is the finite-horizon forecast-error growth slope, lambda_FEG. When the error-growth curve supports a quasi-linear regime, this slope can be compared with reference largest Lyapunov exponents as an estimate of the dominant instability rate. The same pipeline also extracts the formal fit-selection regime, curvature, residual roughness after quadratic detrending, monotonicity, and forecast-error distribution entropy (FEDE) from signed multi-horizon errors. These secondary descriptors are intended not only as diagnostic controls for the slope, but also as candidate machine-learning features for nonlinear signal analysis, because they encode profile geometry and distributional uncertainty not captured by lambda_FEG alone. We evaluate the method on chaotic maps, Mackey-Glass delay dynamics, and scalar Lorenz-63 observables with known or reference exponents. Full-record experiments show good agreement in quasi-linear cases and meaningful curve-shape information in curved or weak profiles. A dyadic length-halving experiment on representative logistic, Mackey-Glass, and Lorenz records shows that residual roughness and mean FEDE often change monotonically and remain interpretable as record length decreases, even when the slope becomes biased or highly variable. The results support treating forecast-error growth as a structured profile and feature-generation framework rather than a single-number estimator.
[LG-209] A Hybrid Gaussian Process Regression Framework for Stable Volatility-Covariance Estimation: Evidence from Global Equity Indices CCL
链接: https://arxiv.org/abs/2605.17275
作者: Ujjwala Vadrevu
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: Working paper. Replication code available at: this https URL
Abstract:Accurate forecasting of the Volatility-Covariance Matrix (VCV) is central to regulatory capital adequacy processes such as the Internal Capital Adequacy Assessment Process (ICAAP) and the Comprehensive Capital Analysis and Review (CCAR). Traditional econometric models, including GARCH-family and Exponentially Weighted Moving Average (EWMA) approaches, suffer from parametric rigidity, distributional assumptions, and numerical instability under stress, leading to systematic underestimation of tail risk. This paper proposes and validates a novel Hybrid Gaussian Process Regression-Historical Simulation (GPR-HS) framework for estimating Value-at-Risk (VaR) and Expected Shortfall (ES) across a diversified portfolio of seven major global equity indices. The framework decouples the VCV estimation problem: individual asset volatilities are modelled dynamically using Univariate GPR with a Matern 5/2 kernel, while inter-asset correlations are estimated via stable historical covariance. A key methodological contribution is the Aggressive Noise Initialization (ANI) strategy, which sets the initial White Noise kernel variance equal to the empirical variance of the training returns, ensuring Gram matrix positive-definiteness, regularization, and conservative, regulatory-compliant forecasts. Evaluated using an expanding window forward-chaining cross-validation scheme over June 2020 -June 2025, the GPR-HS framework achieves regulatory compliance in the majority of test splits; including a 100% ES pass rate at the portfolio level, while outperforming the static Historical VaR benchmark in 71.4% of univariate cases by Quadratic Loss and 100% of cases by violation count.
[LG-210] Scalable Bi-causal Optimal Transport via KL Relaxation and Policy Gradients
链接: https://arxiv.org/abs/2605.17271
作者: Haoyang Cao,Jesse Hoekstra,Renyuan Xu,Yumin Xu,Ruixun Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Bi-causal optimal transport (OT) is a natural framework for comparing and coupling stochastic processes under nonanticipative information constraints, with important applications in robust finance, sequential uncertainty quantification, and multistage stochastic optimization. In particular, a learned bi-causal coupling naturally serves as a simulator for generating joint sample paths that respect both prescribed marginal laws and the underlying information flow. Its practical use, however, is limited by the computational difficulty of enforcing bi-causal coupling constraints over path space, especially for continuous distributions and long horizons. We develop a scalable stochastic-optimization framework for computing bi-causal OT couplings under general marginals. Our approach introduces a Kullback–Leibler (KL)-penalized relaxation that replaces hard marginal constraints with tractable divergence penalties while preserving the recursive structure of the problem. We establish dynamic programming principles for both the original and relaxed formulations, prove that the relaxed problem converges to the original bi-causal OT problem as the penalty grows, and derive explicit policy-gradient representations for the relaxed objective. Building on these results, we propose a practical policy-gradient algorithm with unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees. Numerical experiments show that the method accurately captures marginal laws and temporal dependence, and performs well in applications including robust subhedging and time series statistical downscaling. These results provide a scalable computational approach to bi-causal OT and broaden its applicability in settings where nonanticipative information constraints are essential.
[LG-211] Maximum Likelihood Decoding of Quantum Error Correction Codes
链接: https://arxiv.org/abs/2605.17230
作者: Hanyan Cao,Ge Yan,Yuxuan Du,Feng Pan
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: An invited topical review. Comments are welcome
Abstract:Quantum error correction (QEC) is indispensable for realizing fault-tolerant quantum computation, yet its effectiveness hinges critically on the classical decoding algorithm that interprets noisy syndrome measurements. Among all possible decoding strategies, maximum likelihood decoding (MLD) is provably optimal, since it identifies the logical group with largest likelihood by summing over all possible errors within logical class consistent with the observed syndrome. Despite its optimality, MLD is computationally intractable in general (#P-hard), motivating a rich landscape of exact and approximate algorithms. In this topical review, we provide a unified perspective on MLD by surveying recent advances through three complementary lenses: statistical mechanics, tensor networks, and artificial intelligence. From the statistical mechanics viewpoint, the MLD problem maps onto evaluating partition functions of disordered spin models, enabling exact solutions for certain codes and noise models as well as threshold estimation via phase-transition analysis. From the tensor network perspective, approximate contraction of tensor networks on the code’s factor graph yields decoders that closely approach MLD accuracy with polynomial computational cost. From the artificial intelligence perspective, neural-network-based decoders, including autoregressive generative models and recurrent transformers, learn to approximate the MLD distribution from data, achieving high accuracy with the parallelism afforded by modern hardware accelerators. We discuss the connections among these three approaches, review their application to both simulated and experimental quantum hardware, and outline open challenges including real-time decoding, scalability to large code distances, and generalization to high-rate quantum low-density parity-check codes.
[LG-212] oward Near-Real-Time Marine Oil Spill Detection in SAR Imagery using Quantum-Assisted SVM
链接: https://arxiv.org/abs/2605.17217
作者: Joseph Strauss,Jyotsna Sharma
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Marine oil spills require rapid detection to mitigate severe ecological and economic damage. While satellite-based Synthetic Aperture Radar (SAR) provides essential all-weather monitoring, analyzing this data remains challenging. Deep learning models often require massive datasets and incur high latency. To address this, a pixel-wise quantum-assisted Support Vector Machine (QSVM) bagging ensemble is developed. Quantum annealing is leveraged to optimize the support vectors of individual weak SVMs on small data subsets, which are then classically aggregated. The approach is evaluated on Sentinel-1 imagery using both quantum simulation and physical quantum annealing hardware. The quantum-assisted pipeline achieved performance comparable to a rigorous classical baseline, yielding an Intersection-over-Union (IoU) of 0.60 and a balanced accuracy of 0.89. Complementary experiments with gate-based quantum computing demonstrated similar segmentation accuracy, although the annealing approach offered superior inference efficiency. Generalization was further assessed on independent oil spill imagery from the Strait of Hormuz, demonstrating the potential transferability of the trained pipeline to geographically distinct spill events. These results establish the feasibility of quantum-assisted, segmentation pipelines for near-real-time environmental monitoring. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2605.17217 [quant-ph] (or arXiv:2605.17217v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.17217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-213] Sample efficient inductive matrix completion with noise and inexact side information
链接: https://arxiv.org/abs/2605.17189
作者: Yuepeng Yang,Cong Ma
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Low-rank matrix completion is a widely studied problem with many variants. Inductive matrix completion (IMC) incorporates row and column side information to significantly narrow the search space. Prior work falls into two regimes: methods that exploit this structure to achieve reduced sample complexity but only in noiseless settings, and methods that handle noise but require sample complexity matching the ambient matrix dimension, forfeiting the sample efficiency that side information should provide. In this paper, we close this gap by studying noisy IMC with a nonconvex projected gradient descent algorithm with spectral initialization. Our main technical contribution is establishing a regularity condition for the IMC loss function that holds at the reduced sample complexity determined by the effective problem size, scaling with the side information dimension a rather than the ambient dimension n. This directly yields linear convergence and an estimation error that both depend only on the effective problem size rather than the ambient matrix dimension. We further extend our analysis to the inexact side information setting, demonstrating that the reduced sample complexity is maintained and the estimation error is order-optimal with respect to the inexactness of the side information. Extensive simulations and real-world experiments on the MovieLens dataset validate our theoretical findings.
[LG-214] High-dimensional Limit of SGD for Diagonal Linear Networks
链接: https://arxiv.org/abs/2605.17177
作者: Begoña García Malaxechebarría,Courtney Paquette,Maryam Fazel,Dmitriy Drusvyatskiy
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 91 pages, 5 figures
Abstract:Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.
[LG-215] Sparse Mamba Decoder for Quantum Error Correction: Efficient Defect-Centric Processing of Surface Code Syndromes
链接: https://arxiv.org/abs/2605.17156
作者: Samira Sayedsalehi,Nader Bagherzadeh,Maxim Shcherbakov,Jean-Luc Gaudiot
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 22 pages, 7 figures, 10 tables. Neural decoder for surface code quantum error correction. Submitted to Quantum
Abstract:Quantum error correction (QEC) is essential for building fault-tolerant quantum computers, requiring decoders that are simultaneously accurate, fast, and scalable. Most state-of-the-art neural decoders achieve high accuracy but process the full dense syndrome array of size O(d^2 R) regardless of the actual error rate, where d is the code distance and R is the number of measurement rounds. At physically relevant error rates (p ~ 0.1%), fewer than 5% of syndrome entries contain active detection events – yet existing decoders process the entire syndrome volume. We introduce the Sparse Mamba Decoder (SMD), a defect-centric neural decoder that processes only the k active detection events using a 13-dimensional feature representation per defect and a Mamba state-space backbone, achieving O(k) complexity. Across depolarizing, uniform circuit-level, SI1000, and Google Sycamore experimental benchmarks, SMD reduces the MWPM logical error rate by up to 49% at d \le 5 under SI1000 noise, runs 95-467x faster than the Tesseract near-MLD decoder and 232-463x faster than Belief Matching, and maintains nearly constant latency (24-57 us) across d = 3-9 under uniform circuit-level noise. On the Sycamore experimental dataset, the SMD ensemble matches or slightly surpasses the dense Mamba decoder of Varbanov et al. All results are obtained on commodity NVIDIA GPUs with 7.5-16M parameters, without specialized accelerators.
[LG-216] Multi-task Linear Regression without Eigenvalue Lower Bounds: Adaptivity Robustness and Safety ICML2026
链接: https://arxiv.org/abs/2605.17126
作者: Seok-Jin Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted at ICML 2026
Abstract:We study the multi-task linear regression problem in the presence of contaminated tasks. We address the setting where the unknown parameters of a majority of tasks are close in the \ell_2 -norm, while a fraction of tasks are arbitrary outliers. Existing theoretical frameworks for this problem rely heavily on the assumption that the empirical second moment of each task has a minimum eigenvalue bounded away from zero (order \Omega(1) ). Crucially, this assumption fails in many high-dimensional scenarios, rendering prior guarantees vacuous. To overcome this limitation, we propose an estimator based on matrix-weighted norm regularization. We also introduce a relative balancedness condition, quantified by a balancedness constant, that compares each task’s second moment with the average inlier geometry and relaxes the need for taskwise second-moment lower bounds. In favorable regimes with moderate balancedness, our prediction MSE bounds match the rate of Duan and Wang (2023) under substantially weaker spectral assumptions; the resulting task-overall MSE is minimax optimal up to logarithmic factors. Furthermore, we demonstrate that our estimator enjoys a safety guarantee: when the relevant balancedness constant is large or infinite, or when tasks are unrelated, the method performs no worse than independent task learning. Consequently, our methodology achieves simultaneous adaptivity to task similarity, robustness to outliers, and safety outside favorable transfer regimes.
[LG-217] Diffusion-Based Stochastic Operator Networks for Uncertainty Quantification in Stochastic Partial Differential Equations
链接: https://arxiv.org/abs/2605.17107
作者: Phuoc-Toan Huynh,Richard Archibald,Feng Bao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:We introduce a novel framework for uncertainty quantification of solution operators associated with stochastic partial differential equations (SPDEs). Although SPDEs play a central role in modeling complex physical systems under uncertainty, their practical use typically requires specifying the magnitude and structure of model uncertainties that are often unknown and difficult to infer from noisy measurements. To address this challenge, we develop a stochastic operator-learning framework that learns directly from noisy data and outputs both a mean solution field and a quantification of uncertainty. The proposed method, namely the Stochastic Operator Network (SON), is constructed by combining the structure of the Deep Operator Network (DeepONet) with Stochastic Neural Networks (SNNs) to model stochasticity and enable probabilistic prediction. The training procedure is carried out by minimizing a Hamiltonian-type loss and optimizing the resulting objective using the Stochastic Maximum Principle. Numerical experiments on benchmark SPDEs under multiple uncertainty sources demonstrate the accuracy and robustness of the proposed method in capturing solution structure and quantifying predictive uncertainty.
[LG-218] mathcalO(n) alternative to Quantum Fourier Transform with efficient neural net classical post-processing
链接: https://arxiv.org/abs/2605.16998
作者: Kaiming Bian,Zujin Wen,Oscar Dahlsten
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:The Quantum Fourier Transform (QFT) is required by hidden subgroup problem (HSP) algorithms, including Shor’s algorithm for factoring. The circuit depth of the QFT remains challenging for near-term hardware. To find shallower alternatives we identify two properties that are exploited by the QFT to enable HSP. Firstly, the shift invariance of the QFT allows for the removal of a random overall shift. Secondly, the QFT retains information about the hidden subgroup generator accessible in the measurement outcomes. We quantify that information via the discrete Fisher information. We construct a family of shallow circuits using Hadamards and controlled-Phase gates, HP- L circuits, that we prove preserve shift invariance. Numerical analysis shows these circuits retain exponentially growing Fisher information. The \mathcalO(n) HP- 1 can replace the \mathcalO(n^2) QFT in Shor’s algorithm, as demonstrated numerically, with an efficient neural network implementing classical post-processing.
[LG-219] CAST: Causal Anchored Simplex Transport for Distribution-Valued Time Series
链接: https://arxiv.org/abs/2605.16919
作者: Jiecheng Lu,Jieqi Di,Runhua Wu,Yuwei Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Many decision-facing stochastic systems are observed through aggregate distributions rather than scalar trajectories: queue occupancies, mobility shares, public-health mixtures, generation-source shares, ecological compositions, and air-quality severity profiles all live on the probability simplex and evolve over time. We study causal (online) forecasting for these distribution-valued time series and argue that the transition operator itself should be structured around the simplex. We introduce CAST (Causal Anchored Simplex Transport), a successor-local operator that (i) retrieves empirical successors from causal context, (ii) stabilizes them with a persistence anchor, and (iii) applies a bounded local stochastic transport on ordered supports; every stage preserves the simplex by construction. We identify a structural failure mode, latent transition-kernel aliasing, where similar observed distributions evolve differently under different contextual regimes, and prove that any forecaster depending only on an aliased summary incurs an irreducible weighted Jensen-Shannon excess-risk lower bound, while the CAST hypothesis class contains the regime-aware Bayes successor; for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull. On eleven public and simulated benchmarks spanning ecology, energy, diet, mortality, employment, air quality, severe weather, mobility, and G/G/1, G_t/G/1 queue occupancy, CAST attains the best average rank on both one-step KL (1.27) and autoregressive rollout JSD (1.91), winning 8/11 sections on each metric against a broad statistical, compositional, recurrent, convolutional, and Transformer baseline set, and top-2 on all 11 sections for offline KL. Component ablations and a controlled synthetic aliasing experiment corroborate the theory.
[LG-220] A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights
链接: https://arxiv.org/abs/2605.16913
作者: Fabiola Ricci,Claudia Merger,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information – related to pair-wise correlations between pixels – before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within n \ll N^3 steps, but needs at least n \gg N^3 \log^2N steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.
[LG-221] HYVINT: Intensity-Driven Hypergraph Generation with Variational Representations
链接: https://arxiv.org/abs/2605.16836
作者: Xinyi Hong,Shuntuo Xu,Zhou Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Hypergraphs provide a principled framework for modeling polyadic interactions, with applications in recommendation systems, social networks, and molecular modeling. Hypergraph generation remains challenging because incidence structures are discrete, sparse, and governed by heterogeneous higher-order interactions. Existing generators often rely on implicit latent spaces or continuous incidence decoders, which provide limited mechanistic interpretation of how node-hyperedge incidences arise. To address these limitations, we propose HYVINT, an intensity-driven hypergraph generative framework. Our key innovations are twofold: (i) we develop an intensity-driven incidence formation mechanism for hypergraphs that links latent interaction strength to binary incidence, and (ii) we derive a tractable lower-bound variational estimator for learning latent representations. We provide generation error bounds with asymptotic convergence rates and empirically show that HYVINT achieves strong fidelity while maintaining substantial novelty and diversity on synthetic and real-world hypergraphs.
[LG-222] Statistical Unlearning of Distributions: A Hypothesis Testing Approach
链接: https://arxiv.org/abs/2605.16645
作者: Aaradhya Pandey,Sanjeev Kulkarni
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Comments welcome
Abstract:Machine learning systems increasingly face requirements to forget not only individual data points, but entire domains of information, such as toxic language, copyrighted corpora, or demographic biases. This raises a fundamental dilemma of statistical-computational tradeoffs: removing all samples from an unwanted domain may be computationally prohibitive, while randomly removing a subset may not provide distribution-level statistical guarantees. We propose a statistical framework for distributional unlearning, in which domains are modeled as probability distributions, and the goal is to remove a carefully chosen subset of samples that reduces the effect of an unwanted distribution while preserving performance on a desired one. We formalize this using a hypothesis test of the edited data with the desired and unwanted domains, leading to an interpretable and robust criterion for selecting samples to remove. Within this statistical framework, we characterize the fundamental region of the allowable edited data distributions and the removal-preservation Pareto frontier for a broad class of distribution families. This includes parametric families such as shifted Gaussians of arbitrary dimension, a one-dimensional location family with log-concave noise, and the one-dimensional Poisson family. It also includes nonparametric families such as the Gaussian white noise model, a canonical model for nonparametric regression. We prove composition rules that describe how distributional unlearning behaves across multimodal unwanted domains, and introduce a central-limit behavior for the removal-preservation baselines when composing a large number of such families. Finally, we provide finite sample guarantees by providing Pareto frontiers for some selection algorithms, and observe an information-computation gap.
[LG-223] StAD: Stein Amortized Divergence for Fast Likelihoods with Diffusion and Flow
链接: https://arxiv.org/abs/2605.16486
作者: Gurjeet Jagwani,Stephen Thorp,Sinan Deger,Hiranya Peiris
类目: Machine Learning (stat.ML); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures
Abstract:Diffusion and flow-based models are ubiquitously used for generative modelling and density estimation. They admit a deterministic probability flow ordinary differential equation (PF-ODE), analogous to continuous normalizing flows (CNFs), which describes the transport of the probability mass. Obtaining the likelihood from these models is of interest to many workflows, especially Bayesian analysis, and requires solving the trace of the Jacobian to compute the divergence of the learned PF-ODE, which is either \mathcalO(D^2) to compute exactly or \mathcalO(D) with a noisy estimate. We introduce StAD, a new distillation method to predict and learn the divergence of the PF-ODE using the Langevin-Stein operator without ever computing the Jacobian. We show that our method is competitive with the Hutchinson and Hutch++ on CIFAR-10, ImageNet and other density estimation tasks, consistently improving the variance and speed of the likelihood predictions compared to the Hutchinson. We additionally show our method will generalize to a varied class of generative models, and show that under some regularity conditions these learned vector fields can be made to satisfy the Stein class.
[LG-224] Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures
链接: https://arxiv.org/abs/2605.16473
作者: Lorenzo Baldassari,Josselin Garnier,Knut Solna,Maarten V. de Hoop
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:
Abstract:Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.
附件下载


